Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2019 May 6;2019:370–378.

Extending i2b2 into a framework for semantic abstraction of EHR to facilitate rapid development and portability of Health IT applications

Kavishwar B Wagholikar 1,3,5, Layne Ainsworth 5, Vishal P Vernekar 4, Ameet Pathak 4, Corey Glynn 5, David Zelle 5, Akshay Zagade 4, Neelima Karipineni 1,2, Christopher D Herrick 5, Marian McPartlin, Tiffany V Bui, Mike Mendis 5, Jeffery Klann 1,3,5, Michael Oates 5, William Gordon 2, Christopher Cannon, Rahul Patel 4, Samuel J Aronson 5, Calum A MacRae 1,2, Benjamin M Scirica 1,2, Shawn N Murphy 1,3
PMCID: PMC6568124  PMID: 31258990

Abstract

The wide gap between a care provider’s conceptualization of electronic health record (EHR) and the structures for electronic health record (EHR) data storage and transmission, presents a multitude of obstacles for development of innovative Health IT applications. While developers model the EHR view of the clinicians at one end, they work with a different data view to construct health IT applications. Although there has been considerable progress to bridge this gap by evolution of developer friendly standards and tools for terminology mapping and data warehousing, there is a need for a simplified framework to facilitate development of interoperable applications. To this end, we propose a framework for creating a layer of semantic abstraction on the EHR and describe preliminary work on the implementation of this framework for management of hyperlipidemia and hypertension. Our goal is to facilitate the rapid development and portability of Health IT applications.

Introduction

There exists a substantial gap between how care providers conceptualize electronic health record (EHR), and how the data is stored in the EHR system. This presents a multitude of obstacles for development of innovative Health IT solutions.

First the data is generally stored using proprietary formats and requires special knowledge to convert to standard formats.1 The adoption of interoperability standards like Fast Health Interoperability Resources (FHIR) Application Programming Interface (API) holds the promise to resolve this issue.2,3 However, EHR vendors are slow to implement such APIs. Additionally the FHIR standard only defines the syntax for sharing health data, and prescribes use of standard terminologies for semantic interoperability.4 Consequently, the scope of the FHIR standard is limited to the syntactic transformation of EHR data.

Second the EHR data is stored using proprietary coding systems, and an elaborate mapping is required to transform the data into a standard coding system, which may not be easily accessible to developers.5,6 For instance many EHR systems have their own coding system for medications, while RxNorm in a standard coding system for medications. A mapping from the local medication code to RxNorm would be required to translate the medication data from that site to RxNorm.7

Third, care providers often combine multiple EHR elements to derive facts during clinical decision making.8 These derivative variables range from simple calculations such a deriving body mass index (BMI) using age and height to computing complex scores like 10-year risk of developing atherosclerotic cardiovascular disease (ASCVD).9 Developers address this by implementing the derivation logic in their application layer, which increases the complexity of the application and hinders reuse of the logic across applications.10 Consequently, healthcare innovators spend disproportionate efforts to synthesize the EHR data into readily useful concepts needed to drive their applications.11

To address these obstacles, we propose a framework for creating a layer of semantic abstraction on the EHR. Furthermore, we describe our preliminary work on implementation of the framework and application to a use case of management of hyperlipidemia and hypertension. Our goal is to facilitate rapid development and portability of Health Information Technology (IT) solutions.

Methods

Our framework builds on the Integrating Biology and the Bedside (i2b2) open source clinical data analytics platform that is used at more than 150 health care institutions for querying patient data.12,13 The platform is composed of several i2b2 cells that provide different services, and the cells communicate with each other using XML web services.

The methodology for creating the semantic abstraction is described in Figures 1 and 2.

  1. The first step is for the knowledge engineer to work with the clinical team to model the information needed at the point-of-care. The engineer uses a canonical form to provide semantic context for the concepts. For example. the concept of Low-Density-Lipoprotein (LDL) test and Diabetes Mellitus are represented in Table 1. The concept hierarchy serves as a central place for team members to unambiguously find the concepts for the project.

  2. Next, the concept hierarchy is imported into i2b2 concept dimension and metadata table as shown in Table 2. Additional rows are needed in the concept dimension table for parent paths of the concepts.

  3. The site data is converted into a tabular form, such that each row has elements needed to construct a fact in the i2b2 data model. It includes the 7 elements that form a composite primary key: time-stamp and identifiers for patient, provider, encounter, concept and modifier, and an instance number. The latter correspond to patient_num, provider_num, encounter_num, time_stamp, concept_cd, modifier_cd and instance_num column in the fact table. The concept code is the site specific identifier or proprietary code for the concept. Instance num allows facts with more than one modifier to be grouped together (the facts with different modifier nums are related by using the same instance_num). For example, a medication order can be represented in the fact table as shown in Table 3. In addition to these columns, the table includes the numeric or text values and/or units associated with the concepts, that correspond to the num_val and char_val fields in the i2b2 observation_fact table. For disparate sources or data types multiple fact-table like tables are created.

  4. Next the fact-tables are imported into the observation-fact table in the i2b2 database, with the table name appended as source_name of the facts from each table (see Figure 2)

  5. Next the site data analyst reviews the ‘observation-fact’ like tables along with site documentation to develop a many-to-one mapping from the site-specific codes to the standard codes for concepts identified by the knowledge engineer.

  6. The local to standard code map is imported into the i2b2 concept dimension table by appending the site-codes to the standard codes at the leaf nodes in the concept ontology, essentially adding an additional level to the ontology hierarchy. Effectively, the ontology now embeds the terminology mappings.

  7. Derived variables are implemented as SQL queries, or compute intensive processes using a programming language. For example, the following SQL expression gives age from date of birth –age int AS (SELECT DATEDIFF(DAY, date_of_birth, GetDate()) / 365.25.)

  8. The derived variables are asynchronously computed and imported into the fact table.

  9. The i2b2 webclient API can be now used to search on patients, or the i2b2 getPDO (get Patient Data Object) webservice API can be used to retrieve data for a patient.

Figure 1:

Figure 1:

Methodology for implementing the framework

Figure 2.

Figure 2.

Framework construction proceeds from left to right, beginning with definition of clinical variables needed at the point-of-care, and decomposing them into granular concepts that can be expressed in the form of codes using standard coding systems. For instance, the concept of Body Mass Index (BMI) is derived from weight and height concepts that are explicitly recorded in the EHR. The layer of concepts represented with standard codes is connected to a site-specific translation layer that includes the local/proprietary to standard code mappings and the logic to assemble data into meaningful facts. Although the framework is implemented from left to right, the data flows from right to left as shown in the figure.

Table 1.

Concept definitions by the knowledge engineer

Standard-Code Path Description
LOINC:57698-3 /Labs/Blood/Chemistry/Lipid_Profile/LDL Lipid Panel with Direct LDL
ICD-10:E10 /Diagnosis/Endocrine_nutritional_andmetabolic_diseases/Diabetes_mellitus/Type1 Type 1 diabetes mellitus

Table 2.

Rows in the concept dimension table for creating hierarchy for LDL.

c_hlevel c_fullname c_name c_synonym_cd c_visual attributes c_facttablecolumn c_table name c_columnname c_columndata c_operator c_dimcode
0 \Labs\ Labs N CA concept_cd concept_dimension concept_path T LIKE \Labs\
1 \Labs\Blood\ Blood N FA concept_cd concept_dimension concept_path T LIKE \Labs\Blood\
2 \Labs\Blood\LDL\ LDL N FA concept_cd concept_dimension concept_path T LIKE \Labs\Blood\LDL\

Table 3.

Medication information consists of name, dose and route. Rows are transposed to columns for readability.

encounter_num 100 100 100
patient num 11612123 11612123 11612123
concept_cd meds:antihtn meds:antihtn meds:antihtn
provider id 1000403 1000403 1000403
start_date 02-15-2018 00:00:00 02-15-2018 00:00:00 02-15-2018 00:00:00
modifier_cd MED:DOSE MED:FREQ MED:ROUTE
instance num 1 1 1
valtype_cd N T T
tval char Daily Oral
nval_num 40 0 0
units_cd mg
end date 02-25-2018 00:00:00 02-25-2018 00:00:00 02-25-2018 00:00:00

Implementation

We have partially implemented the proposed framework including the process of importing data from relational databases and the execution logic to derive variables that can be modelled as a SQL statement. The import module in the framework is implemented as a java process that parses the concept definitions in comma separated value (csv) format to create an i2b2 ontology table. It imports data into i2b2 from csv dumps from a relational table. The import module can ingest csv files containing concept definition and hierarchy to create concept dimension and metadata tables in i2b2. Next the import module monitors an ftp site for tables dumped in csv format and based on the schema imports data into i2b2 tables for observation_fact, provider and encounter dimensions. The i2b2 webservices and webclient are used to retrieve data needed at the point-of-care. Specifically, the webclient can be used to construct a cohort of patients for clinical intervention. We created the i2b2 installation using i2b2-docker images.13,14

Case study

Treating elevated cholesterol and blood pressure is known to reduce atherosclerotic cardiovascular disease (ASCVD) risk. With this goal, Brigham and Women’s Hospital (BWH) has developed a remote lipid and hypertension management program designed to provide patient education and support, effective LDL and blood pressure lowering treatment, and advanced lipid specialist management to a wide population of high cardiovascular risk patients through a navigator-driven program. This program entails manual screening of at risk patients from the patient population, which can be facilitated by an application that uses EHR data. We applied the above framework to create a semantic layer for developing an application for screening patients for the program.

The knowledge engineer from the study team collaborated with the clinical experts to define the concepts needed at point-of-care in csv files as described in the methodology section. The framework was used to automatically create the i2b2 ontology from these files. A data analyst obtained extracts from institutional data warehouse for a random sample of patients, in the ‘fact table like’ format. These were ingested into the i2b2 hive, using the framework’s import module that created rows in the fact tables with appropriate links to the i2b2 dimension tables. For testing the framework, we imported data for 20 random patients, and verified if the i2b2 interface can be used to successfully query these data.

Results

The study team was able to model 81 concepts needed at the point of care as shown in Figure 3. For 41% of concepts a code could be unambiguously identified from Unified Medical Language System (UMLS). Table 5 shows the wide variation in distribution one-to-many standard-to-local codes mapping of the concepts. Finally, the study team was able to successfully execute i2b2 queries using the i2b2 webclient for the random sample of patients.

Figure 3.

Figure 3.

Clinical concept Hierarchy

Table 5.

Distribution of local code mappings for a sample of clinical concepts

Standard Concept Count of local concepts
diastolic blood pressure 2
systolic blood pressure 6
estimated glomerular filtration rate 33
Hemoglobin A1c 159
Antihypertensive Medications 635
Blood LDL 693

Discussion

Our results demonstrate the feasibility of applying the proposed framework for a real-world use case of screening patients for management of hyperlipidemia and hypertension. Our proposed framework provides a clear separation between the syntactic and semantic parts of the data transformation. The semantic information is modelled in the ontology hierarchy, that acts as a single source of concept definition. The semantic component is further decomposed as ‘biological/clinical logic’ and logic required to translate standard concept codes to local concept codes. The synaptic component transforms input sources to a de-normalized star schema of i2b2.

Our framework builds on the i2b2 platform, which is currently installed at over 150 institutions across the United States. While i2b2 is generally used for storing all data that is available in the EHR, our framework leverages the i2b2 platform to serve data limited to a particular application. Second, before this work, i2b2 did not provide the built-in functionality to asynchronously perform compute intensive operations to ‘derive’ new facts/observations based on existing facts, which we have now implemented in our framework. Third, the i2b2 platform lacked tooling to assist with loading of the type of data described in this paper, which we have implemented in the framework.

The framework includes syntactic translation that deals with various aspects of having the source data from organizations to be formatted in accordance with a particular standard. Source data has multiple channels of availability and transport, along with the format envelope. The syntactic transformation indicates the need for this source data to be de-constructed in a medium where it can be stored in a structure that i2b2 supports. Our framework provides a transformation tool and mechanism to store each source data category as a different table. The ingestion functionality translates each of the input tables into key value pairs that map to concept and fact tables respectively, thus adding the capability and flexibility of having as many attributes as may be needed, to fully and completely define that particular data entity.

Semantic translation involves the mapping of organization specific codes, to align with standard reusable codes that will be predefined and created within the solution’s application layer. These are also referred to as concepts, and these concepts are essentially key attributes corresponding to clinical or biological variables needed to make point-of-care decisions. The underlying premise of this approach is that standard preexisting and in-built data structures (concept codes) within the semantic layer, are mappable to multiple organization’s diverse data using a mapping process so that differences in specific data codes from different organization still map to the standard structure within the semantic layer. This mapping configuration, helps avoid re-tooling and re-coding, and promotes re-use and easy configurability mechanism for adoption by multiple entities. This essentially is a the key promise of this approach, which been shown to be previously effective in the deployment of Shrine networks, wherein a network concept is translated to site specific concept by a xml mapping file.15

However the semantic translation capabilities of our framework extend beyond terminology mappings, to allow modeling of logic for derived variables. e.g. the logic for computing body mass index (BMI) from height and weight can be stored as a SQL operation or a program logic that can be called asynchronously. This facilitates a focus on the reusability aspects of the underlying process and associated technology components, that once created and added to the library of available concepts, can be reuse for another application. A similar approach has been implemented in the Eureka tool.16

Provenance aspects of data quality require the capture of traceability and time-stamp for each source data. Our framework captures provenance information, thereby helping trace the origin and time of creation for each fact. The timestamping helps to resolve any conflicts and overriding factors in case of any duplicate data scenarios.

A limitation of our work is that we have tested the framework only on a sample of 20 patients, and only used relational tables from our own institutional data warehouse as the input source. We are currently using the framework to import a large sample of 100,000 patients, and plan to integrate it with Simple Object Access Protocol (SOAP) web services and Health Level Seven (HL7) version 2 (v2) message interfaces to synchronize the data with EHR in real time. Another limitation is that we have currently only partially implemented the asynchronous mechanism to compute derived variables (as described in steps 7 and 8 in Figure 1), which we plan to do the near future.

Conclusion

We have described a framework for a semantic layer between the EHR and the applications that facilitates a key step required for rapid development and portability of Health IT applications. Further we have partially implemented the framework and demonstrated proof-of-concept of its application for screening patients for management of hyperlipidemia and hypertension. Future work includes completing the implementation of the framework by allowing import of data from non-relational sources, use of standards like clinical quality language to model logic for derived variables,17 and clinical pilot in the production setting.

Table 4.

Sample Rows for labs imported into fact table. The rows are transposed to columns here for readability.

encounter_num 101
patient_num 11612121
concept cd labs:blood:ldl
provider_id 1000403
start date 2018-01-28 00:00:00
valtype_cd N
tval_char
nval num 84
units_cd mg/dL
end date 2018-01-28 00:00:00

Acknowledgement

This work was supported by a National Library of Medicine grant R00-LM011575, National Genomic Research Institute grant R01-HG009174, Partners healthcare and Persistent Systems.

References

  • 1.Wagholikar KB, Jain R, Oliveira E. Evolving Research Data Sharing Networks to Clinical App Sharing Networks. AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science. 2017;2017:302–7. [PMC free article] [PubMed] [Google Scholar]
  • 2.Wagholikar KB, Mandel JC, Klann JG. SMART-on-FHIR implemented over i2b2. Journal of the American Medical Informatics Association : JAMIA. 2017;24:398–402. doi: 10.1093/jamia/ocw079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Fast Healthcare Interoperability Resources: Draft Standards for Trial Use 2 HL7. 2015. https://www.hl7.org/fhir/2015May/index.html.
  • 4.Solbrig HR, Prud’hommeaux E, Grieve G. Modeling and validating HL7 FHIR profiles using semantic web Shape Expressions (ShEx). Journal of biomedical informatics. 2017;67:90–100. doi: 10.1016/j.jbi.2017.02.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Pathak J, Solbrig HR, Buntrock JD, Johnson TM, Chute CG. LexGrid: A Framework for Representing, Storing, and Querying Biomedical Terminologies from Simple to Sublime. Journal of the American Medical Informatics Association. 2009;16:305–15. doi: 10.1197/jamia.M3006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Tao C, Pathak J, Solbrig HR, Wei W-Q, Chute CG. Terminology representation guidelines for biomedical ontologies in the semantic web notations. Journal of biomedical informatics. 2013;46:128–38. doi: 10.1016/j.jbi.2012.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Nelson SJ, Zeng K, Kilbourne J, Powell T, Moore R. Normalized names for clinical drugs: RxNorm at 6 years. Journal of the American Medical Informatics Association. 2011;18:441–8. doi: 10.1136/amiajnl-2011-000116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Rasmussen LV, Thompson WK, Pacheco JA. Design patterns for the development of electronic health record-driven phenotype extraction algorithms. Journal of biomedical informatics. 2014;51:280–6. doi: 10.1016/j.jbi.2014.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Rana JS, Tabada GH, Solomon MD. Accuracy of the Atherosclerotic Cardiovascular Risk Equation in a Large Contemporary, Multiethnic Population. Journal of the American College of Cardiology. 2016;67:2118–30. doi: 10.1016/j.jacc.2016.02.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Mo H, Thompson WK, Rasmussen LV. Desiderata for computable representations of electronic health records-driven phenotype algorithms. Journal of the American Medical Informatics Association. 2015;22:1220–30. doi: 10.1093/jamia/ocv112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jason. The MITRE Corporation 0500; Artificial Intelligence for Health and Health Care. [Google Scholar]
  • 12.Murphy SN, Weber G, Mendis M. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). Journal of the American Medical Informatics Association : JAMIA. 2010;17:124–30. doi: 10.1136/jamia.2009.000893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wagholikar KB, Dessai P, Sanz J, Mendis ME, Bell DS, Murphy SN. Implementation of informatics for integrating biology and the bedside (i2b2) platform as Docker containers. BMC Med Inform Decis Mak. 2018;18:66. doi: 10.1186/s12911-018-0646-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wagholikar KB, Mendis M, Dessai P. Automating Installation of the Integrating Biology and the Bedside (i2b2) Platform. Biomedical informatics insights. 2018;10:1420335253. doi: 10.1177/1178222618777749. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Weber GM, Murphy SN, McMurry AJ. The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories. Journal of the American Medical Informatics Association : JAMIA. 2009;16:624–30. doi: 10.1197/jamia.M3191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Post AR, Krc T, Rathod H. Semantic ETL into i2b2 with Eureka! AMIA Joint Summits on Translational Science proceedings AMIA Joint Summits on Translational Science. 2013;2013:203–7. [PMC free article] [PubMed] [Google Scholar]
  • 17.HL7 Cross-Paradigm Specification: Clinical Quality Language, Release. Vol. 1. Health Level Seven; 2018. [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES