Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2020 May 30;2020:326–334.

Standardized Architecture for a Mega-Biobank Phenomic Library: The Million Veteran Program (MVP)

Kathryn E Knight 1,^, Jacqueline Honerlaw 2,^, Ioana Danciu 1,^, Franciel Linares 1, Yuk-Lam Ho 2, David R Gagnon 2,3, Everett Rush 1, J Michael Gaziano 2,4, Edmon Begoli 1,*, Kelly Cho 2,4,*; On Behalf of the VA Million Veteran Program
PMCID: PMC7233040  PMID: 32477652

Abstract

Electronic health records (EHRs) provide a wealth of data for phenotype development in population health studies, and researchers invest considerable time to curate data elements and validate disease definitions. The ability to reproduce well-defined phenotypes increases data quality, comparability of results and expedites research. In this paper, we present a standardized approach to organize and capture phenotype definitions, resulting in the creation of an open, online repository of phenotypes. This resource captures phenotype development, provenance and process from the Million Veteran Program, a national mega-biobank embedded in the Veterans Health Administration (VHA). To ensure that the repository is searchable, extendable, and sustainable, it is necessary to develop both a proper digital catalog architecture and underlying metadata infrastructure to enable effective management of the data fields required to define each phenotype. Our methods provide a resource for VHA investigators and a roadmap for researchers interested in standardizing their phenotype definitions to increase portability.

Introduction

Electronic health records (EHRs) provide a plethora of information and offer unique opportunities for population- based research, such as advanced methods development that leverages this data for phenotype development. However, the challenge remains in the lack of flexible and accessible platforms for storing, reusing, and sharing metadata to leverage cross-cutting efforts and to promote synergy between these initiatives. Moreover, researchers often spend valuable time and funding dollars scouring the literature to understand work already performed in their area of interest or developing their own phenotypes. These phenotypes include diseases and observable traits. Systematic dissemination of definitions and methods is a critical step to accelerate phenomic science and further medical research. Previous efforts to develop libraries of phenotypes do exist [1] [2], but they lack in standardization of definition criteria, advanced search capabilities, data visualization, and wide accessibility by the research community.

In this paper, we present the infrastructure supporting a phenomics library that ensures reproducibility of high quality research products derived from Million Veteran Program (MVP) projects [3] using data from the Veterans Health Administration (VHA) EHR. This research is innovative in a few different ways. First, this phenotype catalogue maintains all data definitions in a standard metadata database, enabling search across all fields and visualization of the phenotype definitions captured. Secondly, it applies metadata standards whenever applicable, and documents any local or custom data fields in a Metadata Application Profile (MAP) [4]. All relevant metadata files and documentation are bundled together in the CKAN [5] Github repository to aid both maintenance and reuse. Thirdly, by standardizing the metadata fields, it facilitates automated ingestion of phenotypes from multiple sources such as various research groups or literature mining.

Methods

Healthcare setting

Initial phenotypes were derived from the VHA, the largest integrated healthcare system in the United States, comprised of over 170 medical centers [6]. The VHA EHR system provides unique opportunities for medical research. The electronic documentation in this system reflects the entire care continuum and includes data with great variety from billing codes, medications, and laboratory tests to procedures and unstructured note data. The integrated nature of the healthcare system reflects in a dataset free from the loss to follow-up problem that plagues the biomedical research field [7]. Longitudinality is another important aspect of the Veterans’ Affairs (VA) dataset, which spans over 20 years. The phenomics library serves as a resource to MVP and VA researchers, but will be open to the entire biomedical community.

Million Veteran Program (MVP)

MVP is a mega-biobank cohort launched in 2011 to establish a national, representative, and longitudinal study of Veterans that combines the data from survey instruments, electronic health records, genomics, and biospecimens. MVP is the largest ongoing mega-cohort biobank program in the US with over 800,000 enrollees as of Fall 2019. Details on the design of MVP have been previously described [3]. The objective of MVP is to understand how genetic characteristics, behaviors, military exposure, and environmental factors affect health. Ultimately, by providing a framework for scientifically-valid and clinically-relevant precision medicine, MVP’s goal is to enhance the care of the veteran population and beyond. Currently there are a series of MVP test projects that have been producing early scientific work products [8] [9] and generating knowledge on both phenomics and genomics domains. In this paper, we focus on the linkage of phenomics data based on the VA EHR in the context of phenomics library architecture development led by the MVP Data Core and the Department of Energy (DOE) Oak Ridge National Laboratory (ORNL) Core group through MVP Computational Health Analytics for Medical Precision to Improve Outcomes Now (CHAMPION) [10] initiatives.

Phenotypes

Our priority phenomics data domains that populate this library are diseases, demographics, medications and laboratory tests commonly used to identify a cohort or describe their characteristics. The phenotype definitions range from ICD based classifications using simple rules to complex algorithms yielding a probability of disease. Mapped data elements, such as laboratory tests, and phenotypes are uniquely identified using phenotype_ref_id. This feature allows for multiple definitions for the same phenotype which allows users to choose the algorithm which most appropriately fits the goals of their study.

Storage Infrastructure - CKAN

The project committee chose CKAN (Comprehensive Knowledge Archive Network) [11] as the library application framework. This open source framework is built with Python[12] (backend) and Javascript [13] (frontend). It has already been implemented by several other government-led open data initiatives [14], and includes multiple features that met the core requirements set by the committee. While our phenotype catalogue does not exactly fit with CKAN’s primary intended use case as a data management tool, its core components include highly customizable user and extension interfaces, an extendible metadata management feature, a built-in search engine, and file storage functionality. Figure 1 shows CKAN’s main components.

Figure 1.

Figure 1.

CKAN’s Main Architecture Features

Front end

CKAN’s front end is a Web Service Gateway Interface (WSGI) built on Pylons [15] (soon to migrate to Flask [16]). Figure 2 shows a view of the user interface.

Figure 2.

Figure 2.

User interface

There are no set web server or deployment configurations. Constructed using Jinja2 HTML templates [17], it interacts with other CKAN components either via CKAN’s RESTful API[18] or the URL routing system. Thus, for any desired customization of the outward-facing catalog, a developer could override any of CKAN’s default options by virtue of these functions: a few edits to the configuration file can redirect CKAN to a customized set of template files, etc. For external users, the stock interface provided includes a limited sidebar faceted search (organization, tag, format, and license), as well as a single “google-like” search box. For internal users, CKAN provides a means to add new datasets, including a simple metadata cataloging form populated by some basic, fairly generic metadata fields. Datasets may be ordered by organization, group, or license.

Phenotype metadata

CKAN’s customizable metadata management component was crucial to our project, as we wished to implement a formal, standardized metadata structure for phenotypes. Though the default metadata fields in CKAN are basic, these may be modified and extended either by designing a custom plugin (“ExtraFields” in Figure 1) or by using a combination of existing extensions created by members of the CKAN community. Our library makes use of the following community-created metadata plugins: Scheming [19], Repeating [20], and Composite [21]. Because metadata is managed in individual JSON files, these plugins allow for relatively straightforward maintenance, modification, extension, and an addition to our existing phenotype metadata schemas. In many situations, the actual R code or SQL queries are vital for systematic reproducibility of phenotypic results and cohorts. CKAN provides a means to associate the source code with metadata records by creating an attachment or url link and users with advanced privileges are permitted to access these programs stored in a separate GitLab repository.

Metadata Schema

Figure 3 shows the metadata database schema.

Figure 3.

Figure 3.

Underlying database schema

The database structure follows a Reverse Star Schema (RSS) [22]. A RSS returns the intersection of dimensional data as opposed to the traditional Star Schema, which shows dimensional data aggregated or filtered based on some criteria. This approach allows for a phenotype to contain multiple measurements, drugs, specimens (laboratory results). RSSs describe cause-and-effect relationships among indirectly related data concepts. It constitutes an on-demand means to identify and refine patient cohorts based on potentially disconnected data points such as: specimens, demographic data, active/remission status, conditions, treatments, medications, etc. The RSS makes this possible by providing variable granularity in phenotypes and extensibility of the data model.

The naming and the applicable table fields include both the Observational Medical Outcomes Partnership (OMOP) naming convention [23] and adds a number of locally-defined MVP data fields. This lends interoperability with external, widely-used data models such as OMOP, but also adds much needed modularity, granularity, extensibility to accurately capture the nuances in MVP data. As with the Dublin Core model [24], ours aims are to balance precision with the need for easy and efficient information exchange. The central table, Phenotype, contains a uniquely identifying metadata for each phenotype: id, creator, maintainer, date created, performance measures, etc. The unifying key is the phenotype_ref_id. Each phenotype links to one or more entries in the other tables. This is necessary because a phenotype called “diabetes” can link to multiple diabetic medications, for example.

The fundamental difficulty of developing a MAP is that, to date, there is no terminology to define what this actually entails. A MAP profiles data to meet a particular set of needs while maintaining an already established metadata standard to facilitate information exchange. Often, a MAP is expressed as a spreadsheet, but corresponds to the underlying schema that defines the data fields in a library. The Dublin Core Metadata Initiative (DCMI) community has defined MAP guidelines within the context of the DCMI vocabulary [24], however no such guidelines exist for general profiles that make use of multiple controlled vocabularies. The W3C has recently begun drafting a Profiles Ontology [25], a document defining how entities are described as statements and corresponding values, as well as the problem of constraints: clearly describing in a machine actionable way what fields are expected, repeatable, mandatory, and so forth.

For the Phenotype Library to adhere as much as possible to proposed best practices and ongoing metadata initiatives like those described above, data fields were mapped to established controlled vocabularies. Specifically, we chose OMOP, Dublin Core, and several W3C vocabularies. For fields not found in any existing vocabularies, we created and documented local fields for domain, range, and cardinality.

Custom Ontology Extensions (COEs)

Flexibility is one of the main features of our phenotype library. Our infrastructure accommodates growth in size and complexity in three ways. First, as new data sources become available to phenotyping efforts, we are including them by creating new tables in our metadata schema. An example would be the addition of new type of -omics such as proteomics or metabolomics. Secondly, for each data source we have an extensible data model structure that allows the addition of other fields to accommodate new attributes. For example, in the future it might be important to have information about the type of facility collecting a specimen: inpatient, outpatient, emergency room, etc. In this situation, we would just add an additional field to the Specimen table called “facility”. Thirdly, we are accounting for flexibility in data mappings to ontologies. RxNorm [26], NDC [27] and FDB [28] are all possible ontologies to represent medications. We are allowing the phenotype author to decide the best standardization for their use case. All additional fields will come from existing ontologies or controlled vocabularies; however, if a field is needed that does not currently exist in a published ontology, the field will be vetted by metadata specialists and added to the project’s metadata application profile.

Advanced search functionality

Metadata is indexed using SOLR [29], an open source search engine developed by the Apache Software Foundation. SOLR uses an inverted index data structure to support quick document retrieval (even with complex queries on large datasets), and has the ability to sort results by relevance. Metadata is persisted in a PostgreSQL [30] database. Code and query files attached to phenotypes are not internally searchable, but the phenotype definition which contains the same content in human readable form is. Information retrieval is instant, and displayed in a format similar to that of an online library catalog.

Security

To search the catalog and download files, a login is not required. However, access to the cataloging function to add new metadata and associated files does require a user account. Permissions are set and managed by system administrators, who may edit organization details, user details, and permanently delete records from the catalog. Records are “owned” by organizations. The roles and permissions of non-admin users may be determined by organization, i.e. certain users may only be given permission to create/edit records related to their own organization, etc. For privacy and security purposes, all related code uses a separate GitLab code repository and its role-based access model. This library does not contain patient level data.

Results

We implemented the database schema using several common use cases such as demographics, medications and laboratory tests. Figure 4 shows the metadata for these phenotype examples in each of these domains.

Figure 4.

Figure 4.

Sample Mapping of MVP Phenotypes

We are extending existing vocabularies (OMOP, Dublin Core) to both retain granularity within the VA phenotypes and maintain compatibility with external data sources and systems. For instance, the VA phenotype for Beta Blockers required extensive documentation of generic medications found in the medication phenotype. While there is no direct mapping to the OMOP schema to do this, we are able to map to OMOP’s drug_concept_id, domain_id, and concept_id fields, creating “hooks” that could potentially link our library metadata with outside data stores.

The custom fields used across the schemas include a description of the algorithm contents, data elements employed, source data used for phenotype development and the population the algorithm was developed in. New and existing classifications including MESH and OMOP are used to categorize phenotypes and facilitate searching. Provenance of the metadata is captured using Dublin Core fields. Internally, any changes to the records will be captured as administrative metadata (e.g. user name, field changes, and timestamps).

Phenotypes created by MVP authors are unique in that they may incorporate elements that use a data model such as OMOP or use a custom mapping. For example, VA researchers working with our Data Core team have performed manual mappings of certain laboratory tests and have identified inconsistencies with specific OMOP mapped LOINC codes. Similar mappings have been performed for medications. The medication and laboratory schemas in the library allow for entry of these custom mappings or reference existing OMOP fields.

Discussion

Since our approach uses templates for the phenotype data entry step, the result is structured data that can be stored and queried in a more rigorous fashion. This process also has implications for the size of the backend data, which also translates to faster retrieval speeds. Searching across the phenotype library using different keywords such as diseases, medications, billing codes takes advantage of the backend database indices.

The use of schemas for data capture ensures that metadata is recorded consistently across phenotypes and enables both internal comparison and potential interoperability with external systems. This approach standardizes phenotype definitions and facilitates additional phenotype cataloging, especially via automation (eg. article scraping). Finally, standardized vocabularies and schemas enable extensibility, making it possible to apply ontologies to the schema for enhanced discoverability and visualization. Given that MVP has performed additional data mapping work, it is important to have the flexibility to use our own ontologies as needed.

Our approach creates citable phenotypes: not only do we include a data field that provides a standard citation format, but CKAN has integration capability with Datacite’s [31] API to provide a DOI for any newly registered record.

One potential drawback to our approach includes the technical debt involved in implementing and maintaining an open source system. The future “cost” of maintaining CKAN and the related schema as these components age is something to consider. The system is open source, and so will require a dedicated system administrator and related technical professionals to perform maintenance, updates, and fixes if or when components malfunction due to outdated libraries, etc. That said, as the data is persisted in a PostgreSQL, it can be exported to other platforms, or integrated with other discovery layers or information retrieval projects.

Conclusion

The linkage of large longitudinal VA EHR and other biomarker and omics data is one of the strengths of MVP mega- biobank. Such comprehensive data coverage and the scale of the large population in VA and MVP provide unprecedented opportunities for new discoveries in both biomedical research and infrastructure development for scalable solutions. Developing an optimal data management (cataloging, storing, searching, sharing, and archiving) structure of the EHR-based phenomic library is a critical factor in expediting research towards translational science.

In addition, secondary use of healthcare data for research is increasingly common as more EHR systems are implemented, as a result of HITECH [32]. Using a standardized phenotype library architecture facilitates sharing of phenotypes across the research community both internally within the VA and externally with the medical research community. The MVP COE architecture proposed will be an ideal framework as the MVP phenomics library evolves over time in complexities and the ever-increasing dimensions of data universe. This platform will also lay the groundwork in addressing challenges of interoperability across various EHRs.

Acknowledgments

We would like to thank everyone who has contributed to the implementations described in this paper, in particular: David Heise, Ben Taylor, Ben Mayer, Hope Cook, Lauren Costa and Jeffrey Gosian.

The views expressed in this article are those of the authors, and do not necessarily reflect the position or policy of the Department of Veterans Affairs. The authors thank the members of the Million Veteran Program Core, those who have contributed to the Million Veteran Program, and especially the Veteran participants for their generous contributions. The Million Veteran Program is funded by the Office of Research and Development, Department of Veterans Affairs, supported by grant MVP000.

This manuscript has been in part co-authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy, and under the MVP CHAMPION program between the Department of Veterans Affairs (VA), and the Department of Energy (DOE).

References


Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES