Abstract
The ability to conduct genome-wide association studies (GWAS) has enabled new exploration of how genetic variations contribute to health and disease etiology. However, historically GWAS have been limited by inadequate sample size due to associated costs for genotyping and phenotyping of study subjects. This has prompted several academic medical centers to form “biobanks” where biospecimens linked to personal health information, typically in electronic health records (EHRs), are collected and stored on large number of subjects. This provides tremendous opportunities to discover novel genotype-phenotype associations and foster hypothesis generation. In this work, we study how emerging Semantic Web technologies can be applied in conjunction with clinical and genotype data stored at the Mayo Clinic Biobank to mine the phenotype data for genetic associations. In particular, we demonstrate the role of using Resource Description Framework (RDF) for representing EHR diagnoses and procedure data, and enable federated querying via standardized Web protocols to identify subjects genotyped with Type 2 Diabetes for discovering gene-disease associations. Our study highlights the potential of Web-scale data federation techniques to execute complex queries.
Introduction
In the past decade, there has been a plethora of discoveries in genomic sciences involving complex, non-Mendelian diseases that relate single-nucleotide polymorphisms (SNPs) to clinical conditions and measurable traits. This has become feasible due to the advances in high-throughput genotyping technologies and genome-wide association studies (GWAS) that allow studying the entire human genome in thousands of unrelated individuals regarding genetic associations with different diseases. However, unlike Mendelian traits, effect sizes of genetic variants associated with common diseases are relatively small and thus large sample sizes are required for discovery.
To address this research need, several academic medical centers are forming biorepositories or biobanks that collect and store individual biospecimens from which DNA for conducting genetic research can be extracted. Additionally, these biobanks are often linked to electronic health records (EHR) that support retrieval and querying for vast amounts of phenotype data. The Electronic Medical Records and Genomics (eMERGE1) consortium—a network of seven academic medical centers, of which Mayo Clinic is a member—has demonstrated the applicability of “EHR-derived phenotyping algorithms” for cohort identification to conduct GWAS for several diseases, including peripheral arterial disease2 and atrioventricular conduction3. A common thread across the library of algorithms4 is access to different types and modalities of clinical data for algorithm execution, which includes billing and diagnoses information, laboratory measurements, patient procedure encounters, medication and prescription management data, and co-morbidities (e.g., smoking history, socio-economic status). While on one hand these approaches with EHR-linked biorepositories have successfully facilitated GWAS, such studies typically focus on a narrow phenotypic domain, such as presence or absence of a given disease and ignore the potential power that can be gained through intermediate and sub-phenotypes, as well as considering pleiotropic associations. Furthermore, most existing GWAS results are based on populations with European decent, thereby limiting the understanding of genetic contribution to diseases and traits for other racial and ethnic populations. To this end, there has been an emerging interest in mining the human phenome via a “reverse GWAS” or a PheWAS (Phenome Wide Association Scans)—for a given genotype, to identify the set of associated clinical phenotypes. By using clinical data from EHRs, a PheWAS allows systematic study of associations between a number of common genetic variations and variety of large number of clinical phenotypes. Recent studies by Denny et al.5 and Pendergrass et al.6 demonstrated the potential for PheWAS to replicate previously published genotype-phenotype associations, as well as, identify novel associations using patient EHR data. However, to extract phenotype data from EHRs, one is posed with the challenge of representing and integrating data in a form that would allow federated querying, reasoning, and efficient information retrieval across multiple sources of clinical data and information.
The work proposed in this study is an attempt to address this challenge by exploring and experimenting with Semantic Web technologies for enabling a PheWAS. A key aspect of Semantic Web is a rigorous mechanism for defining and linking heterogeneous data using Web protocols and a simple data model called Resource Description Framework (RDF). By representing data as labeled graphs, RDF provides a powerful framework for expressing and integrating any type of data. As of March 2012, under the auspices of an initiative called the Linked Open Data (LOD7), more than 250 public datasets from multiple domains (e.g., gene and disease relationships, drugs and side effects) are available in RDF, and have been integrated by specifying approximately 350 million links between the RDF graphs. Not only do such efforts provides tremendous opportunities to devise novel approaches for combining private, and institution-specific EHR data with public knowledgebases for phenotyping, but they also present several challenges in representing EHR data using RDF, creating linkages between multiple disparate RDF graphs, and developing mechanisms for executing federated queries analyzing information spanning genes, proteins, pathways, diseases, drugs, and adverse events.
In this paper, we describe our efforts in representing real patient data, both clinical and genomic data from Mayo Clinic EHR systems and the biobank, respectively as RDF graphs. In particular, we leverage open-source tooling and infrastructure developed by the Semantic Web community to extract phenotype and genotype information on subjects with Type 2 Diabetes Mellitus (T2DM), and conduct a phenome-wide scan to discover new genetic associations, as well as, replicate existing ones. As a proof of concept, we present our results for three SNPs associated with T2DM within an EHR population at the Mayo Clinic Biobank. Our approach highlights the potential of using Semantic Web technologies for exploring a variety and large range of clinical phenotypes derived from EHRs for genomics research in a very high-throughput manner.
Background
Mayo Clinic Biobank and the Genome Consortia
The Mayo Clinic Biobank is an institutional resource for biological specimens, patient provided risk factor data, and clinical data on patients recruited to the Biobank. Operational since 2009, the biobank has enrolled more than 22,000 subjects in an effort to support a wide array of health-related research studies throughout the institution. Study participants provide a blood sample for DNA and serum/plasma research, complete a health risk questionnaire, allow access to medical records, and consent to prospective follow-up for health outcomes. The Mayo Genome Consortia (MayoGC8) is a large cohort of Mayo Clinic patients with clinical data (linked via their EHRs) and genotype data. Formed as a voluntary collaboration of investigators across disciplines at Mayo Clinic, eligible participants in MayoGC include those who gave general research (i.e., not disease-specific) consent to share high-throughput genotyping data with other investigators. The MayoGC cohort is being built in 2 phases. Phase I, which has been completed, includes participants from 3 studies funded by the NIH which sought to identify genetic determinants of peripheral arterial disease (PAD), venous thromboembolism, and pancreatic cancer, respectively, with a combined total sample size of 6307 unique participants (Table 1). Phase 2 is currently underway with the goal of expanding the MayoGC by recruiting eligible patients from other studies funded by the NIH at Mayo Clinic. For this study, we extracted clinical and genotype data on all 6307 subjects from Phase I of MayoGC (Table 1).
Table 1.
MayoGC Phase I studies (used with permission from Bielinski et al.8)
| eMERGE Network (PAD)2 | GENEVA (VTE)6 | PANC7,8 | |||
|---|---|---|---|---|---|
|
|
|
|
|||
| Characteristics | Cases (n=1612) | Controls (n=1585) | Cases (n=1233) | Controls (n=1264) | Controls (n=613) |
| Age (y), mean ± SD | 66.0±10.7 | 61.0±7.4 | 55.0±16.2 | 56.0±15.8 | 66.0±10.0 |
| Female (%) | 36 | 40 | 50 | 52 | 45 |
| Medical record length (y) | |||||
| Mean ± SD | 23.4±20.0 | 26.1±20.3 | 13.7±16.3 | 21.1±15.4 | 30.2±16.5 |
| Median (range) | 18.7 (1.0–78.6) | 23.0 (1.0–79.2) | 6.3 (1.0–71.8) | 17.8 (1.0–70.2) | 29.8 (1.0–75.0) |
| White (%) | 94 | 94 | 96 | 99 | 100 |
| Geographic location, No. (%)c | |||||
| Olmsted County | 328 (20) | 590 (37) | 7 (1) | 10 (1) | 64 (10) |
| Southeast Minnesota | 191 (12) | 62 (4) | 205 (17) | 378 (30) | 107 (17) |
| Greater Minnesota | 393 (24) | 343 (22) | 314 (25) | 317 (25) | 135 (22) |
| Iowa | 212 (13) | 97 (6) | 176 (14) | 191 (15) | 65 (11) |
| South and North Dakota | 50 (3) | 31(2) | 79(6) | 71 (6) | 19 (3) |
| Wisconsin | 128 (8) | 68 (4) | 121 (10) | 138 (11) | 32 (5) |
| Other states or international | 309 (19) | 394 (25) | 330 (27) | 159 (13) | 191 (31) |
eMERGE = Electronic Medical Records and Genomics; GENEVA = Gene Environment Association Studies; MayoGC = Mayo Genome Consortia; PAD = peripheral arterial disease; PANC = Mayo Clinic Molecular Epidemiology of Pancreatic Cancer Study; VTE = venous thromboembolism.
Percentages may not total 100% because of rounding.
Southeast Minnesota includes 7 counties in the southeast corner of Minnesota: Dodge, Goodhue, Wabasha, Winona, Houston, Fillmore, and Mower. Olmsted County, Minnesota, is a mutually exclusive category.
Genetics of Type 2 Diabetes Mellitus
The prevalence of T2DM has been increasing rapidly in recent years with an estimated 438 million adults suffering from diabetes by the year 20309. While there are numerous non-genetic factors that contribute to the development of diabetes prevalence, recent studies indicate the importance of genetic findings for the pathophysiology, prediction, and treatment of T2DM10. Furthermore, association studies focusing on quantitative traits such as fasting glucose, fasting insulin, and glycated hemoglobin A1C (HbA1c) have shed further light on the genetic susceptibility of T2DM. To date, at least 36 gene loci have been identified that contribute to the genetic risk of T2DM, although this number is expected to increase in the future with larger cohorts being assembled. In particular, current estimates indicate that the gene loci that are associated with T2DM, explain only approximately 10% of the disease heritability. This raises the challenge for finding the remaining heritability as well as identification of additional diabetes-related gene loci that can be expected to lead to creation of clinically relevant disease prediction models. While a detailed discussion on genetics of T2DM is beyond the scope of this paper (interested readers can refer to Herder et al.10), Table 2 below lists some of the gene loci and SNPs that are associated with T2DM or related traits.
Table 2.
Examples of gene loci associated with T2DM and related traits
| Gene Locus | Full Gene Name | SNP | Associated phenotype | Odds Ratio (95% CI) | p-Value | Reference |
|---|---|---|---|---|---|---|
| PPARG | Peroxisome proliferator-activated receptor gamma | rs1801282 | T2DM | 1.14 (1.08–1.20) | 1.7 × 10−6 | Scott et al.11 |
| KCNJ11 | Potassium inwardly rectifying channel, subfamily J, member 11 | rs5219 | T2DM | 1.14 (1.10–1.19) | 6.7 × 10−11 | Scott et al.11 |
| TCF7L2 | Transcription factor 7-like 2 | rs7903146 rs12255372 | T2DM, glucose, HbA1c | 1.37 (1.31–1.43) | 1.0 × 10−48 | Sladek et al.12 |
| SLC30A 8 | Solute carrier family 30 [zinc transporter], member 30 | rs13266634 | T2DM, HbA1c | 1.12 (1.07–1.16) | 5.3 × 10−8 | Zeggini et al.13 |
| FTO | Fat mass and obesity associated | rs8050136 | T2DM, BMI | 1.17 (1.12–1.22) | 1.3 × 10−12 | Scott et al.11 |
Semantic Web and related technologies
A key benefit of using Semantic Web technologies is that it is a rigorous mechanism of defining and linking data using Web protocols in a way, such that, the data can be used by machines not just for display, but for automation, integration, and reuse of across various applications. Specifically, an “attractive” element of the Semantic Web is its simple data model—RDF—that represents data as a labeled graph connecting resources and their property values with labeled edges representing properties. The graph can be structurally parsed into a set of triples (subject, predicate, object), making it very general and easy to express any type of data. Such a model coupled with (i) dereferenceable Uniform Resource Identifiers (URI’s) for creating globally unique names, and (ii) standard languages such as RDFS, OWL, and SPARQL for creating ontologies as well as modeling and querying data, provides a very powerful framework for heterogeneous data integration.
While most clinical and research data is typically stored using relational databases (e.g., Oracle, MySQL) and queried using Structured Query Language (SQL), such technologies have several inherent limitations compared to RDF: (i) First, when database schemas are changed in a relational model, the whole repository, table structure, index keys etc. have to be reorganized—a task that can be quite complex and time-consuming. RDF, on the other hand, does not distinguish between schema (i.e., ontology classes and properties) and data (i.e., instances of the ontology classes) changes—both are merely addition or deletion of RDF triples, making such a model very nimble and flexible for updates. (ii) Second, RDF resources are identified by (globally) unique URI’s, thereby allowing anyone to add additional information about the resource. For example, via RDF links, it is possible to create references between two different RDF graphs, even in completely different namespaces, enabling much easier data linkage and integration. This is rather difficult to achieve in the classical relational database paradigm. (iii) Third, a relational data model lacks any inherent notion of a hierarchy. For instance, simply because a particular drug is an angiotensin receptor blocker (ARB), a typical SQL query engine (without any ad-hoc workarounds) cannot reason that it belongs to a class of anti-hypertensive drugs. Such queries are natively supported in RDFS and OWL. (iv) Finally, due to the lack of a formal temporal model for representing relational data, SQL provides minimal support for temporal queries natively. Such extensions are already in place for SPARQL14.
In summary, linked data, and its enabling technologies such as RDF, provide a more robust, flexible, yet scalable model for integrating and querying data, thereby warranting investigation as to how such technologies can be applied in a clinical and translational research environment. However, while on one hand, such a huge integrated-network dataset provides exciting opportunities to execute expressive federated queries and integrating and analyzing information spanning genes, proteins, pathways, diseases, drugs, and adverse events, several questions remain unanswered about its applicability to high-throughput phenotyping of patient data in EHR systems. In the remainder of this paper, we provide a brief overview of RDF and SPARQL—the building blocks for linked data—and then propose our methods and present preliminary results in using Semantic Web technologies for EHR-derived mining of phenotype data for doing a PheWAS.
RDF and SPARQL
RDF15 is a World Wide Web Consortium (W3C) standardized data model for representing semantic Web resources. It uses graphs to represent information using a triple-based notation comprising a subject, predicate, and an object. All these entities can be uniquely identified by Internationalized Resource Identifiers (IRIs). SPARQL16 is a W3C recommend standard for querying RDF data. Similar in spirit to SQL, a SPARQL query is composed of five parts (Figure 1): zero or more prefix declarations for abbreviating IRIs, zero or more FROM or FROM NAMED clauses stating what RDF graph(s) are being queried, a query result clause specifying what information to return from the query, a WHERE clause specifying what to query for in the underlying dataset, and zero or more query modifiers to slice, order, and otherwise rearrange the query results.
Figure 1.
SPARQL query components
SPARQL specifies one of the four forms of query result clauses: SELECT, CONSTRUCT, ASK, and DESCRIBE, such that SELECT results clause returns a table of result values, CONSTRUCT returns an RDF graph, ASK returns a boolean true or false depending on whether or not the query pattern has any matches in the dataset, and DESCRIBE allows the server to return whatever RDF it wants that describes the given resource(s). The optional set of FROM or FROM NAMED clauses define the dataset against which the query is executed. The WHERE clause is core for any SPARQL query, and is specified in terms of triple patterns. Finally, the optional set of modifiers operate over the result set of the WHERE clause before generating the final query results.
Methods
System Architecture: Representing Patient Records and MayoGC genotype data as RDF graphs
Figure 2 shows our proposed architecture for representing patient health records and genotype data from MayoGC using RDF, linked data and related technologies. It comprises of three main components: (1) data access and storage, (2) RDF virtualization and ontology mapping, and (3) SPARQL-based querying interface. Here we provide a brief overview of these components, and more details were described in our prior work17.
Figure 2.
System architecture for representing patient records and MayoGC data using RDF
Data Access and Storage
This component comprises the patient demographics, diagnoses, procedures, laboratory results, and free-text clinical and pathology notes generated during a clinical encounter as well as SNP genotype data for all the 6307 subjects from MayoGC (Table 1). For accessing the phenotype data, we leverage the Mayo Clinic Life Sciences System (MCLSS18) which is a rich clinical data repository maintained by the Enterprise Data Warehousing Section of the Department of Information Technology. MCLSS contains patient demographics, diagnoses, hospital, laboratory, flowsheet, clinical notes, and pathology data obtained from multiple clinical and hospital source systems within Mayo Clinic at Rochester, Minnesota. Data in MCLSS is accessed via the Data Discovery and Query Builder (DDQB) toolset, consisting of a web-based GUI application and programmatic API. Investigators, study staff, and data retrieval specialists can utilize DDQB and MCLSS to rapidly and efficiently search millions of patient records. Data found by DDQB can be exported into CSV, TAB or Microsoft® Excel files for portability. It implements full data authorization and audit logging to ensure data security standards are met.
It is to be noted that while DDQB provides graphical user and application programming interfaces for accessing the warehouse database, our goal is to represent the data stored in the MCLSS database as RDF. In particular, our goal is to create “virtual RDF graphs” which essentially wraps one or more relational databases into a virtual, read-only RDF graph. This will allow us to access the content of large, live, non-RDF databases without having to replicate all the information into RDF. Consequently, for this study, we obtained appropriate approvals from Mayo’s Institutional Review Board (IRB) for accessing patient information in the MCLSS database using programmatic API and JDBC calls (see more details below). Similarly, for accessing the genotype data from MayoGC, we created virtual RDF graphs.
RDF Virtualization
The RDF virtualization component is based on the open-source Virtuoso Universal Server19 which acts as mediator in the creation of virtual RDF graphs as well as provides a SPARQL endpoint for querying the graphs. In particular, a declarative language is used to describe the mappings between the relational schema and RDFS/OWL ontologies to create the virtual RDF graphs. This language generates a mapping file from table structures of the databases in MCLSS and MayoGC, which can then be customized by replacing the auto-generated terms with concepts from standardized ontologies. In our case, we modified the custom ontology generated by Virtuoso for creating these mappings with terms and concepts from standardized ontologies, including the Translational Medicine Ontology, SNOMED, and Gene Ontology.
SPARQL endpoint
The virtual RDF graphs created from MCLSS and MayoGC using the above approach were exposed via a SPARQL endpoint in the Virtuoso server. This allows software application clients to query the MCLSS and MayoGC RDF graphs using the SPARQL query language. Given that our overarching goal is to integrate the MCLSS and MayoGC RDF graphs, our objective is to execute federated queries across multiple SPARQL endpoints. We discuss the details of SPARQL-based federated querying in the next section.
SPARQL-based federated querying for T2DM phenotype and genotype data extraction
As shown in Figure 2, our goal is to federate between two main data sources: MCLSS and MayoGC, where the former is a DB2 database containing patient clinical and demographic data, and the latter is a MySQL database containing genomic information about patients who have volunteered their DNA information to be stored for medical research. Because participation in the Mayo Clinic Biobank (and hence, in the MayoGC project) is voluntary, the total number of patients in the MayoGC database is a subset of MCLSS. In its current form, one would have to execute a multiple SQL query across both these databases, for example, to find out the diagnoses for all patients who have a particular SNP genotype, to retrieve the appropriate resultset. Instead, by creating RDF views for MCLSS and MayoGC, we demonstrate how this can be achieved using a single SPARQL query (Figure 3).
Figure 3.
Federated SPARQL for MCLSS and MayoGC
In particular, to achieve this goal, two endpoints were created and the SERVICE keyword was used to access each. In the first SERVICE stanza, the MayoGC SPARQL endpoint is being queried to provide the MayoGC Identification number for each patient along with their SNP identifiers (rsID) and genotypes. Since clinicId and patientId are primary keys in the relational database where the information is being stored, the FILTER part of the SERVICE stanza joins the two tables for the query. In the second SERVICE stanza, the MCLSS endpoint is being queried to provide the Mayo Clinic Identification number for the patient along with their diagnosis data. For the tables in MCLSS, the internalKey relationship joins the two tables where their internalKeys are equal. The final part of the federated query joins the information retrieved from the two endpoints. The first FILTER statement joins the MayoGC data with the MCLSS data based on the clinicNumber and the patientId. The final two FILTER statements limit the results for this query to only those patients who have SNP “rs5219” with genotype “C:C”.
Results: Phenome-Wide Scan for Type 2 Diabetes Mellitus
For evaluating our approach, we first identified the case and control status for all MayoGC subjects by executing the T2DM phenotype criteria defined within the eMERGE consortia20. This step was followed by executing the federated SPARQL query illustrated above between the MCLSS and MayoGC endpoints to determine all the T2DM cases having the SNP genotypes (from Table 2), and retrieving the entire set of ICD-9-CM billing and diagnoses codes for each eligible subject. The reasoning behind using the billing and diagnoses codes was two-fold: (1) these codes are universally used within the US healthcare system, and thereby enables future implementation of our approach at other institutions, and (2) the disease, signs and symptoms ICD-9-CM codes can be used as a surrogate for approximating the clinical disease phenotype. However, given that ICD-9-CM was primarily developed for billing and administrative applications and does not necessarily imply a well-defined robust and logical hierarchy for the codes, we used AHRQ’s Clinical Classification Software (CCS21) for clustering the billing an diagnoses data into a manageable number of clinically meaningful categories. In particular, CCS classifies over 14,000 diagnoses and 3,900 procedures from ICD-9-CM into 285 and 231 mutually exclusive diagnoses and procedure categories, respectively, that are assigned a unique identifier. This tool is continually updated by AHRQ and the current version used in this study is based on ICD-9-CM codes that are valid from January 1980 till September 2012.
Figure 4 shows the SNP-disease associations for four T2DM SNPs (see Table 2; ICD-9-CM diagnoses codes clusters having less than 25 subjects are not included). There are several observations that are noteworthy. First, for all the four SNPs, we observe a significant association with diabetes and related traits, such as disorders of lipid metabolism. This replicates a finding by Warodomwichit et al.22 where high (n-6) polyunsaturated fatty acids intakes were associated with atherogenic dyslipidemia in carriers of the minor T allele at rs7903146 SNP in the TCF7L2 gene and may predispose them to Metobolic Syndrome (MetS), diabetes, and cardiovascular disease. Second, while previous studies have positively associated the SNP rs12255372 (Figure 4(c)) with breast cancer23 and prostate cancer24, our findings did not replicate the same association. We believe that this lack of replication is an artifact of the small sample population size studied in this work. Third, for all the four SNPs, there was a significant association with skin and tissue related diseases (e.g., “Other Skin Disorders) that included phenotypes such as systemic sclerosis, corns, and seborrhoeic dermatitis. However, further investigation of the literature did not lead to any existing studies where such an association was reported and thus corroboration of this finding is needed to help rule out a false-positive finding. Finally, since our analysis was done only on T2DM cases (identified via an existing eMERGE algorithm20) on a handful of MayoGC subjects, it is unknown at this time what SNP-disease association patterns will be observed when considering T2DM controls. We discuss all these aspects in the next section.
Figure 4.
SNP-disease associations for T2DM SNPs obtained via phenome mining: (a) SNP rs5219 within the gene KCNJ11; (b) SNP rs7903146 is within the gene TCF7L2; (c) SNP rs12255372 is within the gene TCF7L2; (d) SNP 13266634 is within the gene SLC30A8
Summary
Discussion and Interpretation of Results
Research in clinical and translational science demands effective and efficient methods for accessing, integrating, interpreting and analyzing data from multiple, distributed, and often heterogeneous data sources in a unified way. Traditionally, such a process of data collection and analysis is done manually by investigators and researchers, which is not only time consuming and cumbersome, but in many cases, error prone. The emerging Semantic Web tools and technologies allow exposing different modalities of data, including clinic, research, and scientific, as structured RDF that can be queried uniformly via SPARQL. Not only does this provide the capabilities for interlinking and federated querying of diverse Web-based resources, but also enables fusion of private/local and public data in very powerful ways.
The overarching goal of this study is to investigate Semantic Web technologies for federated data integration and querying using real clinical and genetic data from Mayo’s EHR and Biobank. Using open-source tooling and software, we developed a proof-of-concept system that allows representing patient clinical and genotype data stored in Mayo’s enterprise warehouse system and the MayoGC database, respectively, as RDF, and exposing it via a SPARQL endpoint for accessing and querying. We leveraged existing ontologies, such as the Translational Medicine Ontology, Ontology for Biomedical Investigations, and Gene Ontology, for mapping the MCLSS and MayoGC database schema to standardized semantic concepts and relationships. Our use case for doing a phenome-wide association scan for T2DM demonstrated the applicability of using such an approach for flexibly interlinking and querying multiple heterogeneous data sources in a robust and semantically unambiguous manner. We hypothesize that further development of such a system can immensely facilitate, and potentially accelerate scientific findings in clinical and translational research, including genomics and systems biology.
The ultimate challenge for any PheWAS study is data interpretation. While discovery of new genotype-phenotype associations in PheWAS is important, many of the findings may reflect inter-relationships existing among the phenotypes, sub-phenotypes and endo-phenotypes. As observed in the study by Pendergrass et al., the “novel” results may exemplify pleitropy. For instance, as discussed earlier, all the four SNPs that have been previously associated with diabetes and related traits in Figure 4, also demonstrate a significant association with skin and tissue disorders, thereby indicating the impact of genetic variation on the genes to both phenotypes. Since PheWAS is meant to generate hypothesis, and hence by nature is exploratory, further investigation with large cohort sizes is required to validate such findings.
Limitations
There are several limitations in the proof-of-concept system developed as part of this study First, while we demonstrated the applicability of the system via one use case for T2DM, a more robust and rigorous evaluation along several dimensions (e.g., performance, query response, precision and recall of query results etc.) is required before it can be deployed within an enterprise environment. Note that since our use case for T2DM is based on federated querying of multiple SPARQL endpoints, the system performance and query responses are dependent on the behavior of the endpoints (e.g., the endpoints may experience latency, denial of service). Nevertheless, we plan to perform a thorough system evaluation after the integration of additional MCLSS sources (e.g., laboratory, clinical and pathology reports) that contain large amounts of patient data. Second, we experimented with the recently published Translational Medicine Ontology (TMO) in this study for mapping between MCLSS database schemas to standardized concepts and relationships. While TMO classes are mapped to more than 60 different standardized ontologies, including SNOMED CT and NCI Thesaurus, the scope and breadth of the current TMO release (Version 1.0) is significantly narrow for our purpose. Consequently, along with the creation new classes and relationships, we augmented TMO with Prostate Cancer Ontology and Relations Ontology. Since these extensions are not part of the official TMO release, our goal is to work closely with the TMO curators for content enhancement in future releases. Third, formulating the complex SPARQL queries using existing SPARQL editors is cumbersome and error prone. Our current implementation lacks a more intuitive and user-friendly tool that can assist a “non-Semantic Web savvy” user in the query building process. We plan to address this issue within the timeframe of the project. Finally, while in this study we only considered T2DM as our use case, in future we plan to conduct a large-scale PheWAS with the Mayo Biobank, which currently has more than 22,000 subjects enrolled.
Future Work
In addition to addressing the limitations aforementioned, there are several activities that we plan to pursue in the future. Firstly, in this study, we performed simple mappings between the MCLSS and MayoGC database schemas to classes and relationships in TMO (extended). A more rigorous approach will be to investigate reference information models, such as clinical archetypes25, that provide a mechanism to express data structures in a shared and interoperable way. Secondly, we will investigate existing Semantic Web querying visualization platforms such as SPARQLMotion26 and TripleMap27 that provide more intuitive and user-interactive interfaces for SPARQL query formulation and execution. Finally, we will investigate approaches for distributed and federated inferencing over RDF data. Recent studies28 have demonstrated that even simple subsumption inferences require significant computing power when reasoning over massive RDF datasets. Since access to extremely high-performance computers is not readily available en masse, we will investigate distributed storage and indexing techniques using Apache Hadoop29.
Conclusion
This study demonstrates how Semantic Web technologies can be applied in conjunction with clinical data stored in EHRs to accurately identify subjects with specific diseases and phenotypes, and perform a PheWAS by integrating and analyzing the genotype data with a range of phenotypes. Such an approach has the potential to immensely facilitate the tedious, cumbersome and error prone manual integration and analysis of data for clinical and translational research, including genomics studies and clinical trials.
Acknowledgments
This research is supported in part by the Mayo Clinic Early Career Development Award (FP00058504), the eMERGE consortia (U01-HG-006379), the SHARPn project (90TR002), Mayo Clinic Genome-wide Association Study of Venous Thromboembolism (HG04735), Mayo Clinic SPORE in Pancreatic Cancer (P50CA102701), and Mayo Clinic Cancer Center (GERA Program). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors would like to thank Paul Decker, Rachel Gullerud, and Robert Freimuth for their help with access and preliminary analysis of MayoGC data.
References
- 1.McCarty C, Chisholm R, Chute C, et al. The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Medical Genomics. 2011;4(1):13. doi: 10.1186/1755-8794-4-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Kullo I, Fan J, Pathak J, Savova G, Ali Z, Chute C. Leveraging Informatics for Genetic Studies: Use of the Electronic Medical Record to Enable a Genome-Wide Association Study of Peripheral Arterial Disease. JAMIA. 2010;17(5):568–574. doi: 10.1136/jamia.2010.004366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Denny JC, Ritchie MD, Crawford DC, et al. Identification of Genomic Predictors of Atrioventricular Conduction / Clinical Perspective. Circulation. 2010 2010 Nov 16;122(20):2016–2021. doi: 10.1161/CIRCULATIONAHA.110.948828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.eMERGE Library of Phenotyping Algorithms. https:// http://www.gwas.net. Accessed August 8th, 2011.
- 5.Denny JC, Ritchie MD, Basford MA, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene‚Äìdisease associations. Bioinformatics. 2010;26(9):1205–1210. doi: 10.1093/bioinformatics/btq126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Pendergrass SA, Brown-Gentry K, Dudek SM, et al. The use of phenome-wide association studies (PheWAS) for exploration of novel genotype-phenotype relationships and pleiotropy discovery. Genetic Epidemiology. 2011;35(5):410–422. doi: 10.1002/gepi.20589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Bizer C, Heath T, Berners-Lee T. Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems. 2009;5(3):1–22. [Google Scholar]
- 8.Bielinski SJ, Chai HS, Pathak J, et al. Mayo Genome Consortia: A Genotype-Phenotype Resource for Genome-Wide Association Studies With an Application to the Analysis of Circulating Bilirubin Levels. Mayo Clinic Proceedings. 2011;86(7):606–614. doi: 10.4065/mcp.2011.0178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rathmann W, Giani G. Global Prevalence of Diabetes: Estimates for the Year 2000 and Projections for 2030. Diabetes Care. 2004;27(10):2568–2569. doi: 10.2337/diacare.27.10.2568. [DOI] [PubMed] [Google Scholar]
- 10.Herder C, Roden M. Genetics of type 2 diabetes: pathophysiologic and clinical relevance. European Journal of Clinical Investigation. 2011;41(6):679–692. doi: 10.1111/j.1365-2362.2010.02454.x. [DOI] [PubMed] [Google Scholar]
- 11.Scott LJ, Mohlke KL, Bonnycastle LL, et al. A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants. Science. 2007;316(5829):1341–1345. doi: 10.1126/science.1142382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Sladek R, Rocheleau G, Rung J, et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature. 2007;445(7130):881–885. doi: 10.1038/nature05616. [DOI] [PubMed] [Google Scholar]
- 13.Zeggini E, Weedon MN, Lindgren CM, et al. Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes. Science. 2007;316(5829):1336–1341. doi: 10.1126/science.1142364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Tappolet J, Bernstein A. Applied Temporal RDF: Efficient Temporal Querying of RDF Data with SPARQL. 6th Annual European Semantic Web Conference; Lecture Notes in Computer Science (LNCS); 2009. pp. 308–322. Vol 5554: [Google Scholar]
- 15.Resource Description Framework (RDF) 2011. http://www.w3.org/RDF/. Accessed January 13, 2011.
- 16.Prud’hommeaux E, Seaborne A. SPARQL Query Language for RDF. 2008. 2010. http://www.w3.org/TR/rdf-sparql-query/. Accessed August 14th, 2010.
- 17.Pathak J, Kiefer R, Chute C. Applying Linked Data Principles to Represent Patient’s Electronic Health Records at Mayo Clinic: A Case Report. 2nd ACM SIGHIT International Health Informatics Symposium; 2012. [Google Scholar]
- 18.Chute C, Beck S, Fisk T, Mohr D. The Enterprise Data Trust at Mayo Clinic: A Semantically Integrated Warehouse of Biomedical Data. Journal of the American Medical Informatics Association. 2010;17(2):131–135. doi: 10.1136/jamia.2009.002691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Virtuoso Universal Server 2011. http://virtuoso.openlinksw.com/. Accessed January 16, 2011.
- 20.Kho AN, Hayes MG, Rasmussen-Torvik L, et al. Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. Journal of the American Medical Informatics Association. 2011 doi: 10.1136/amiajnl-2011-000439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.AHRQ Clinical Classifications Software (CCS) for ICD-9-CM Fact Sheet. http://www.hcupus.ahrq.gov/toolssoftware/ccs/ccsfactsheet.jsp. Accessed March 11th, 2012.
- 22.Warodomwichit D, Arnett DK, Kabagambe EK, et al. Polyunsaturated Fatty Acids Modulate the Effect of TCF7L2 Gene Variants on Postprandial Lipemia. The Journal of Nutrition. 2009;139(3):439–446. doi: 10.3945/jn.108.096461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Burwinkel B, Shanmugam K, Hemminki K, et al. Transcription factor 7-like 2 (TCF7L2) variant is associated with familial breast cancer risk: a case-control study. BMC Cancer. 2006;6(1):268. doi: 10.1186/1471-2407-6-268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Agalliu I, Suuriniemi M, Prokunina-Olsson L, et al. Evaluation of a variant in the transcription factor 7-like 2 (TCF7L2) gene and prostate cancer risk in a population-based study. The Prostate. 2008;68(7):740–747. doi: 10.1002/pros.20732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lezcano L, Sicilia M-A, RodrÌguez-Solano C. Integrating reasoning and clinical archetypes using OWL ontologies and SWRL rules. Journal of Biomedical Informatics. 2011;44(2):343–353. doi: 10.1016/j.jbi.2010.11.005. [DOI] [PubMed] [Google Scholar]
- 26.Waldman S. TopQuadrant: SPARQLMotion Visual Scripting Language. http://www.topquadrant.com/products/SPARQLMotion.html. Accessed June 28th, 2011.
- 27.Triple Map. 2011. http://www.triplemap.com/. Accessed August 19th, 2011.
- 28.Goodman E, Jimenez E, Mizell D, al-Saffar S, Adolf B, Haglin D. High-Performance Computing Applied to Semantic Databases. Extended Semantic Web Conference; Athens, Greece. 2011. pp. 31–45. [Google Scholar]
- 29.Apache Hadoop Project http://hadoop.apache.org/. Accessed July 15, 2011.




