Translational Integrity and Continuity: Personalized Biomedical Data Integration

Xiaoming Wang; Lili Liu; James Fackenthal; Shelly Cummings; Maggie Cook; Kisha Hope; Jonathan C Silverstein; Olufunmilayo I Olopade

doi:10.1016/j.jbi.2008.08.002

. Author manuscript; available in PMC: 2010 Feb 1.

Published in final edited form as: J Biomed Inform. 2008 Aug 12;42(1):100–112. doi: 10.1016/j.jbi.2008.08.002

Translational Integrity and Continuity: Personalized Biomedical Data Integration

Xiaoming Wang ^1,^3,^*, Lili Liu ^1,³, James Fackenthal ², Shelly Cummings ², Maggie Cook ², Kisha Hope ², Jonathan C Silverstein ^1,³, Olufunmilayo I Olopade ^2,^*

PMCID: PMC2675887 NIHMSID: NIHMS100305 PMID: 18760382

Abstract

Translational research data are generated in multiple research domains from the bedside to experimental laboratories. These data are typically stored in heterogeneous databases, held by segregated research domains, and described with inconsistent terminologies. Such inconsistency and fragmentation of data significantly impedes the efficiency of tracking and analyzing human-centered records. To address this problem, we have developed a data repository and management system named TraM (http://tram.uchicago.edu), based on a domain ontology integrated entity relationship model. The TraM system has the flexibility to recruit dynamically evolving domain concepts and the ability to support data integration for a broad range of translational research. The web-based application interfaces of TraM allow curators to improve data quality and provide robust and user-friendly cross-domain query functions. In its current stage, TraM relies on a semi-automated mechanism to standardize and restructure source data for data integration and thus does not support real-time data application.

Keywords: data integration, data curation, data integrity, data continuity, translational research

1 Introduction

With completion of the human genome project, scientists are systematically studying the molecular basis of human diseases [1–3] to explore effective individualized therapies [4–7]. To achieve this unprecedented goal, investigators are breaking traditional boundaries between research domains from patient bedsides to experimental laboratories to conduct translational research [7–8]. Data generated from research on different topics need to be extensively reviewed and iteratively verified to become reliable clinical or scientific knowledge [9–10]. However because the majority of clinical and basic research data are currently stored in disparate and separate domain databases, it is often inefficient for a researcher to access these data [11–14]. Furthermore, even where domain data can be aggregated and viewed through a single computational platform, translational researchers still often see incomplete, fragmented, and unverified data in their original forms. These problems greatly impede research efficiency, particularly statistical analysis. Despite overwhelming demands for a modern method to facilitate personalized data tracking, management, and improvement over a translational workflow, few software products that meet these requests are available or widely accepted in the translational research community.

Our goal was to provide a computational system that is able to: 1) integrate data generated from multiple research domains with the flexibility to capture dynamically evolving domain concepts; 2) allow curation for data improvement; 3) support robust and intuitive query functions for biomedical researchers; 4) execute independently from third party products, meaning the system does not have to rely on a direct interaction with source databases (SDBs) or any middleware for its stable performance; and 5) be generic enough that it can be applied to a broad range of translational research. Achieving these goals enables our system to answer important questions that involve data generated in multiple research domains. For example, a translational researcher may ask: 1) How many patients, who were diagnosed with cancer “A” and had pathology records available, share a genetic profile “B”? 2) Which patients who have a special histological cancer type “C” and under a special treatment “D” share a distinct biomarker “E” and a unique family and exposure history? 3) Do these patients have tissue or DNA samples available and where can these samples be obtained for further studies?

Our system, the Translational Data Mart (TraM), was developed upon a domain ontology (DO) [15] integrated entity relationship model (ERM) [16,17] and it has been implemented and in use by several translational researchers. Later in the Case Study section of this manuscript, we will describe how the TraM system is applied in the real world for biomedical data integration and what the TraM data can offer to answer important research questions.

2 Terminology used in this paper

Domain data integrity means the data are “whole” or “complete” according to required information standards set by a particular research domain. For example, microarray data must meet the standards of Minimum Information About a Microarray Experiment (MIAME) as defined by the functional genomics research domain [18].

Translational integrity means that the data completion meets the minimal required standards as defined by a translational research plan, which may include data from multiple research domains. Domain data integrity does not automatically yield translational integrity.

Translational continuity refers to a special data completion status that allows one to track a single person’s data from one research domain to another over a translational workflow.

A data element (DE) is an atomic element within a database. It is equivalent to an attribute in an ERM [16]. A DE is composed of two function domains: a concept domain that holds the abstract name for a set of data that share the same concept and a value domain that carries the records belonging to this concept. For example, “dosage” is a concept, “15” is a value, “unit of measure” is a concept, and “mg/day” is a value.

Translational element (TE) denotes primary identifiers of related domain databases that are mapped to each other and stored within the databases. For example, when the barcode of a tissue sample (originating from a tissue bank database) is mapped to the medical record number (collected from a clinical database) of a person from whom the sample is derived, we say that the medical record number and barcode are TEs of each other. TEs are the DE to assure translational integrity and continuity. If missing, TEs can be recovered by using other critical DEs stored in both SDBs, such as name, date of birth, race, and gender.

Data aggregation vs. data integration: data aggregation is the collective display of data in a unified platform, or physical collection of data within a centralized storage system from separated sources. Aggregated data may or may not relate to each other. Data integration is a special type of data aggregation that requires that aggregated data share TEs.

Personalized data are the data that can be identified as being associated with a distinct person, no matter how distant the data origin or derivatives are.

3 Background

3.1 Translational data status and domain database systems

The challenge of integrating source data from various research domains comes from the nature of translational workflows and the conditions of domain databases. In reality, one domain may contain zero, one, or more databases. Different databases designed for the same purpose may have distinct data structures. A database may have multiple versions and each version often results in a set of data that do not share the same data structure with others. The heterogeneity in concept extraction, data modeling, logical interpretation, naming convention, DE configuration, vocabulary used, and format definition all contribute to the challenge of data integration [19,20]. In addition, if SDBs are not designed to store TEs from other domain databases, the connections among these source data will be disrupted, even though domain data integrity within these SDBs might have been achieved. Furthermore, logically consecutive SDBs in a translational workflow often recruit biomedical records in an autonomously administrative manner. If these databases recruit data from unrelated cohorts, personalized data flow can be truncated without being noticed [21]. These problems all lead to one unwanted consequence: data are inconsistent in their structure and expressions and discontinued in their cross-domain connections. Data in such condition cannot be effectively comprehended and used without thorough cleansing, recovery, reconfiguration, and reorganization.

3.2 Data organization architectures for data integration

Several methods have been proposed to address the problems associated with integrating biomedical data. These methods include semantic mapping [22], ontology and agent methods [23], service-oriented architectures or grids [24–26], distributed search engines [27–29], and federated databases and data warehouse [19,20]. For those interoperable data sharing methods, e.g., service oriented grids, distributed search engines, and federated databases, the availability of a service-enabled infrastructure is essential. This kind of infrastructure has not yet been established or standardized in most medical institutions. The majority of Health Insurance Portability and Accountability Act (HIPAA)-compliant SDBs are proprietary products and many have neither native web-services nor an accessible application programming interface (API), which makes immediate interoperable data extraction plan not feasible. Even if many SDBs are service-enabled, which undoubtedly will greatly enhance data aggregation ability from disparate sources, translational integrity and continuity will not be automatically achieved simply because of improved interoperability. Thorough data cleansing and verification process is likely required before data can be truly integrated and effectively used [20,23]. Furthermore, it will take a tremendous effort and time to make every required SDB in a translational research plan service-enabled. If one of these SDBs happens to be not interoperable, the data held within this SDB have to find other ways to be extracted and integrated. On the other hand, semantic mapping service has been developed to improve data standardization efficiency [22,26]. However, it alone may not be sufficient to resolve deeper problems caused by the divergence of data modeling methods.

It is generally agreed that no single data integration architecture can satisfy all demands of the entire biomedical research community. For the goals we intend to achieve, in particular to improve translational data integrity and continuity, data warehouse and federated databases are most appealing [19,20]. The two approaches are based upon entirely different design theories and result in distinct system architectures. Each of them has its strengths and limitations. Table 1 (modified from Louie B., et al. [20]) compares the two systems noting issues specific to translational research. For both architectures, the challenge of achieving broad system adaptability in different SDBs environments is daunting, although the coping methods are different. We believe that the heterogeneity of SDB architectures and segregation of domain database managements in different institutions will have a larger impact on federated databases than on data warehouses. Data warehouse architecture is a stand-alone system, and only access to source data is required for its basic function.

Table 1.

Data integration system comparison for translational data

Architecture	Requirement	Advantages	Disadvantages	Applications
Data Warehouse	Source data	Excellent query performance; allow curation; support data cleansing	Limited data coverage; not real-time data; extra data copy; inconsistent data copy between SDB and warehouse	Streamlined data; high quality data; curation required; personalized translational continuity required; real-time data not required; global range detailed data not required
Federated DB	Transparent SDB and network architecture; accessibility to API of constituent SDBs	Updated or real-time data; flexible data coverage; no extra data copy	Little data cleansing; infrastructure dependency; performance may be affected by constituent SDBs; not easy to achieve the continuity of personalized data across research domains	View data as they exist; real-time data required; global range all data coverage preferred; “write” permission not required; personalized translational continuity not required

Open in a new tab

3.3 Data integration methods

Data integration methods are classified into three subtypes [19,23]: (i) information linkage, (ii) query translation, and (iii) data translation. Information linkage uses a URL to access data in an HTML form presented by other computation platforms through the Internet [23]. Query translation is meant to convert source data on the fly and present data via a virtual data organization structure [19] and usually is a part of data federation solution. This approach does not require physically storing an extra copy of data. Therefore data always stay in their original form in the SDBs. Data translation is often associated with a data warehouse method. The end product of this approach is a physical copy of data that may not be presented or organized the same way as they were in their original storage systems [19,20]. This method, together with data warehouse architecture, seems somewhat under-appreciated in discussions on biomedical data integration technology [19,22,26]. However, they have been suggested to be more suitable for integrating clinical and human genetic profile data [20,23].

From our day-to-day experiences dealing with various translational data, we have found that preprocessing translational raw data requires considerable knowledge of research subjects in order to obtain reliable information. Therefore, in addition to automated data translation procedures, human intervention is required to assure data reliability. A data warehouse solution supports a unified platform for further data curation. Curating integrated data is not new in biomedical data management. Many public biomedical databases [30–35] have been subject to, and are continuously undergoing, expert-assisted curation.

4 Methods

After carefully comparing the advantages and disadvantages of the major system architectures for our intended goals (Table 1, section 3.2), we decided to use data warehouse architecture as our data storage method. To design a data warehouse that can sustain the progress of translational research, we must first dissect translational data in order to understand the fundamentals underlying their enormous complexity.

4.1 Anatomy of translational data

A typical translational data point can be placed in a three-dimensional space (Fig 1A). The first dimension (x-axis) comprises the material objects coming from human subjects. They are the research objects in a translational study. In this dimension, each derived object inherits all the characteristics from the upper level object, and passes on its own characteristics to its derivatives. The integrity of the object transition in this dimension lays the foundation for translational continuity. The second dimension (y-axis) represents the concepts of scientific knowledge in various research domains. Each domain independently exists and is ruled by its internal logic. The associations between these domains only occur when they have conducted research on the objects from the same individual. The third dimension (z-axis) is temporal, which is clearly important for tracking the status of any ongoing project. Therefore, a typical translational data point always contains three essential elements: a domain concept, a research object, and a time stamp. This model does not explicitly display names and physical locations of research facilities, as they are emphasized in a clinical data model [36].

A global view of dataflow (Fig 1B) illustrates the assembled translational data points described above. It further reveals the translational logic that defines the relationship between research domains and research objects, e.g., a mammogram is applied to a person while genotyping to a DNA sample. When the integrity of research objects is achieved, the possibility of translational continuity is established.

4.2 Data modeling

4.2.1 The conceptual data model

The rationale underlying the conceptual data model is the analysis of translational data anatomy described in Fig 1 (section 4.1). The backbone structure of the TraM data warehouse is an ERM [16] that extracts data entities from a translational workflow and constructs the relationships between these entities. A highly simplified ER diagram is outlined in Fig 2A: Research object entities, corresponding to the x-axis of Fig 1A, are in “one-to-many” relations that cascade from a person object to the samples derived from this person. These objects and research domains (y-axis of Fig 1A) are generally in “many-to-many” relations presented in diamond shapes in Fig 2A. Each of these relationships corresponds to a set of three dimensional data points summarized in Fig 1A, and contains a time stamp indicated in z-axis of Fig 1A as an attribute (e.g. a diagnosis date or a treatment date). There are no direct relationships between research domains, unless they are associated with research objects from the same origin described in Fig 1B.

Fig. 2 — The TraM data model: A. The backbone structure of ER diagram; B An ontology prototype for a medical demographic questionnaire

4.2.2 The logical data model

Often in data modeling, the logical design process of defining attributes to entities reshapes the concept model because it makes us rethink the correctness of earlier conceptual design. An example of these recurrent activities is the integration of domain ontology (DO) [15] into the TraM ERM backbone structure (Fig 2). Ongoing scientific inquiries often produce new concepts and/or classes of new concepts dynamically. At the rudimentary stage, these concepts are usually not well classified. In order to efficiently recruit these concepts and the data generated under these concepts, we need a more flexible data structure able to capture these concepts and data in controlled vocabulary without disturbing the database architecture. Domain ontology suits this demand well and is fully supported by ERM technology [17]. A DO structure supports concept classification and treats concepts as data, but a DO alone does not accommodate many-to-many relationships. Therefore, it is not suitable for connecting research objects (person, specimen, and sample) to research data. Furthermore, the hierarchy of a DO can not connect to the other DOs without higher order ontology [15,37]. The purpose of TraM is not for integrating DOs per se. Instead, its goal is to integrate domain data. These data may belong to the concepts classified in a DO. The solution is to establish a many-to-many relationship between the leaf class of a DO and a research object, to create new domain concepts and to integrate new domain data instantaneously.

To illustrate how a DO can play an important role in a translational data integration system, consider medical demographic survey data, one of the least standardized and structured datasets in translational research. It is not uncommon to see the same survey concept (i.e., question) worded differently in several questionnaires and to have the data value (i.e., answer) to the same question expressed in a variety ways. The number of survey questions for a survey subject varies from fewer than ten to hundreds. Survey subject matter changes as research interest shifts, and no one can really be certain whether a new question will emerge and what the question will look like. Therefore, little database support exists for this fluctuation in data and some authors suggest such data do not belong in the clinical conceptual data model [36]. In reality, many survey results remain on paper or in locally-designed ACCESS databases. These databases often treat a survey question as an attribute (a column). Thus, adding, removing, or changing any question will cause a change in table structure, so that we often see multiple versions of a database for the same purpose, and each version contains similar survey data in different organizations and descriptions. As a consequence, it is extremely difficult to align the survey results and integrate them with other domain data, despite the fact that they routinely need to be integrated with other clinical records for a translational research plan.

To resolve this problem, we propose an ontology structure to manage the data (Fig 2B). In this hierarchy, the super class (branch) defines question sets, such as lifestyle or medical history, which do not belong to any upper concepts except the questionnaire itself. The subclass (category), which can be one or more layers, classifies a general concept for a set of real questions, such as dietary habit or history of hormone replacement therapy. The question item is the leaf class of this ontology. Each item contains a set of attributes for a real question, such as “what,” “when,” “how,” and “why,” etc. Accordingly, each of these questions also has a set of properties that define an answer, such as data type (number or text), unit of measure (cup/day, pack/day, ug/ml), and predefined answer options. In this model, a new question and its properties are treated as a new record (a new row in question item table). Thus, the overall data structure stays the same even if a new survey question is added to the system. Each selected answer during a survey is previously defined in controlled vocabulary when building the ontology. This answer will be recorded in a relationship between a person entity and a question item entity, so the survey results are seamlessly integrated with other domain data.

4.2.3 The physical data model

The process of finalizing the physical design of the TraM model is focused on DE configurations (attribute names, data type, and formats) and constraint classifications. Instead of mirroring the DE configurations from SDBs, we make decisions based upon our analysis on the nature of source data, regardless of original data structures. The difference in DE configuration between the TraM approach and SDB methods reflects the disagreement over data modeling at both conceptual and logical levels. For example, the definition and configuration of DEs can be very different when the same data content is restructured from a one-to-many relationship to a many-to-many relationship [16].

4.3 Data integration workflow

Two kinds of data integration methods are used for the TraM data integration. The information linkage method (via URL) [23] is used to connect the data in the public domain for biological concept adoption and reference information, as we do not see the necessity to restructure or represent these data.

The data translation method [19,23] is used to aggregate individuals’ data from private and domain SDBs into TraM. A typical data translation workflow contains procedures for data concept extraction, data model conversion, data element reconfiguration, semantic mapping, data matrix reorganization, and data standardization. The entire process is illustrated in Fig 3 in which a medical demographic survey dataset is used as an example to depict the process of converting data from a non-ontology structure into a DO structure and further integration into the TraM schema.

In this typical data translation workflow (Fig 3), human intervention usually occurs at an early stage of the dataflow. Examples include ontology content development (concept extraction and classification) and data verification (recovering missing data and judging conflicting data). Reusable procedures, such as data model conversion (transforming a set of data from one-to-many relation into many-to-many relation), data matrix transposition (changing columns to rows and vice versa), data reformatting and sorting, and data deployment, are automated. Some tasks, which also occur at an early stage of this dataflow, need both human and computation interaction. Examples include tokenization and standardization of a free text field (detailed in the Case Study section).

5 System

5.1 System components

The TraM system contains three major components (Fig 4): (i) a relational database supported by Oracle, (ii) a web-based application system supported by Tomcat, and (iii) a data translation toolkit developed with various technologies. Currently, the first two components are being used directly by early-adopter translational researchers, while the toolkit is prototyped and operated by informaticians.

5.2 TraM database

5.2.1 Streamlined data coverage

If all the data from each and every domain in biomedical research needed to be collected, a data integration project would become almost impossible. Moreover, collecting every detail of a patient’s medical records is usually unnecessary. Therefore, the TraM data coverage is designed to keep the scope of data results-driven and the information highly condensed. For example, the entire specimen/tissue banking data are condensed into two entities: specimen (unprocessed material from a human body: e.g., blood, urine, and solid tissues) and sample (processed material from a specimen: e.g., DNA, RNA, paraffin-embedded tissue, and cell lines). Only the barcode of a sample and a few attributes (organ name and sample name) are required. No operational details about the preparation of samples are included in the TraM database and the storage location of a sample is not required either. Required attributes in the TraM schema were defined based on discussions between the TraM data model designer and biomedical research domain experts. Therefore, even though the TraM data cover extensive domain information, the scope of the data is slim and streamlined.

5.2.2 Data dependency control

To assure translational data continuity, we define a dependency rule for research objects (Fig 1) from a person to the samples derived from this person by foreign-key-not-null constraints. A bio-sample without required person information will be rejected by TraM. This rule ensures rigorous control over integrity and continuity of translational data.

5.2.3 Identifying HIPAA compliant data

The ability to track medical records and laboratory results to a particular person adds great value to translational research [6,7,9]. The TraM system creates a static ID for each person when the uniqueness of the person is validated. This static ID functions as a primary public ID to a person and is mapped to the original patient ID in the “person” physical table of the TraM database. However, all personal identifiable records are filtered out (unlinked) when a materialized view is created under a different user name (an oracle concept). Both the materialized view and the entire TraM schema are behind a fire wall, while the query application is restricted to interact only with the materialized view. Thus, personalized data can be identified through this public ID, but are still de-identified to comply with HIPAA regulations [38]. This method is relatively trivial in a data warehouse architecture, but can be challenging to a query translation method to an interoperable data integration system.

5.2.4 Terminology adoption and classification

The TraM system avoids using locally invented terminology for data descriptors. If the public domain provides reputable domain ontologies or concept nomenclatures, the TraM system will adopt these standards as valid terminology to describe the TraM data. TraM preloads the concept descriptions and codes of International Classification of Disease (ICD) [39] for primary disease descriptors, since this classification system is used in many clinical SDBs in the United States. As a research database, TraM also supports NCI Thesaurus [35] and SNOMED-CT [40] nomenclatures and relies on Unified Medical Language System (UMLS) [41] to map SNOMED-CT and ICD concepts [42–44]. For the molecular biology knowledge, TraM relies on hyperlinks to locate the updated and detailed information in public knowledgebase through the Internet. Examples include UniProt [30], Entrez Gene [31], OMIM [34] and dbSNP [45].

For some research areas where no reputable nomenclatures are available in the public domain, the TraM system provides predefined DO structure to assist researchers in creating their own concept ontologies. For example, TraM provides DO structures for medical demographic survey questionnaires and physical exam names.

5.2.5 One schema for multiple medical specialties

It is unwise to build a specific database schema for each of different translational research projects. Such an approach is not only costly for application development, but also troublesome for system adoption and maintenance. In reality, although research domains vary in their translational research workflows, many of them overlap. Furthermore, there is a strikingly common logic of scientific conduct among a wide range of translational workflows, though a given research project may have domain usage preference (e.g., diabetes research more often focuses on results from metabolic laboratory tests, while cancer may focus on pathology reviews). De-coupling unchanged data structure (e.g., the three dimensional data point and translational business logic flow described in Fig 1) from frequently changed parameters (e.g., the actual research domains required in a particular research plan and domain scientific concepts and their nomenclatures) reveals the feasibility of using one schema to support data from multiple medical specialties. In other words, the application range of TraM is determined by the translational logic flow not the research subjects.

5.3 TraM application system

Three types of application modules are designed in the application layer to meet the specific goals of our project, and each type has distinct architecture and logic control.

5.3.1 Account management module

As TraM is intended to support multiple topics of translational research and allow curation for continuous data quality improvement, data privacy and security control are necessary. The account management module is designed to control data accessibility of each project and to hide all HIPAA-protected information from regular users. These functions are implemented through a session control mechanism. Four types of user roles are defined under each account and each type of role has different data accessibility. The account administrator is an end user who is responsible for assigning each of the other users a proper role based upon Institutional Review Board (IRB) protocols. A graphical user interface (GUI) for the account administrator was developed for this purpose (Fig. 5). We assume account administrators know their colleagues and collaborators better than a database administrator. The regular user and power user have only “read” permission to query the TraM data. The difference between them is that a power user (usually a physician) can view patient medical record numbers while a regular user cannot. The curator (usually a data manager or someone who has domain knowledge but does not directly conduct translational research) has both “read” and “write” permission and can see HIPAA-protected information. The users in one account do not have access to the data owned by other accounts unless permission is granted by those account administrators. In addition, the account management module maintains a user ID associated activity log to record data manipulation history.

Fig. 5 — Account administrator interface: account administrator can assign a proper role to a TraM user.

5.3.2 Curation modules

A conventional data warehouse usually does not provide a curator GUI. For a translational data integration system, where most data come from heterogeneous sources with uneven qualities, data curation is essential. The curation modules of different domains share similar architectures and logical flows. The curator GUI supports “read” and “write” abilities to facilitate data improvement. With underlying data model support, curators are able to create new concepts for a given domain when needed and use these concepts to curate data immediately. Fig 6 illustrates how a medical survey questionnaire is defined within a DO structure and how such a questionnaire is instantly used to provide concepts for the survey records. The underlying mechanism is to allow the new question to play a double role: first, as a record (value) in the questionnaire DO structure and second, as a question concept to survey result. In this way, a curator gains enormous flexibility in recruiting new concepts within a predefined data structure and in annotating data with controlled vocabularies of this DO.

Fig. 6 — Curator interface: A) A curator can create new concepts within a pre-defined questionnaire ontology. B) These questions and answer options can be used instantly to curate medical survey records.

5.3.3 Query modules

The query modules are developed across research domains based on materialized views and text index methods. De-identification is implemented in these modules for HIPAA compliance. To make expressing queries intuitive to biomedical researchers, a query-by-example (QBE) style GUI (Fig 7) was used [46,47]. The interface allows users to interactively select query filters, decide “and,” “or,” and “not” conditions at each filter, execute query commands dynamically, and determine which data fields are to be displayed in a query return. Thus, one can query TraM data from any point along an entire translational dataflow, and receive query returns bi-directionally between research domains, from patient bedsides to experimental benches. Personalized data can be tracked historically (retrospective and prospective data) and translationally (across various research domains) through a primary person ID (PubID, described in 5.2.3). The relevant public data can be reached conveniently and efficiently through a URL. The normalized data can be exported to Excel so the data are ready for statistical analysis or to exchange with other data management systems.

Fig. 7 — Query interface: A) Data from a genetic epidemiology study in a cohort of Nigerians (detailed in case study); B) Data from domestic patients. The results of A and B are obtained by using different query filters and display options.

5.4 Data translation toolkit

The toolkit of TraM contains four kinds of utilities: 1) an online data dictionary for TraM, 2) a set of data templates in table format with predefined domain concepts for requesting source data, 3) a group of programs functioning as parsers and data model converters, and 4) a set of SQL scripts for data deployment. These utilities are developed for a typical data translation process (Fig 3 of section 4.3). Many of them can be customized for various data integration projects. The data translation tools are not designed for a special application area. Instead, they are built for manipulating certain data structures, which can appear in many application areas. The functions of these tools have been detailed in the Data integration workflow section (4.3).

6 Case Study

6.1 Data integration project overview

Data from 33,290 individuals under 22 IRB approved protocols belonging to several translational research projects have been used to assess TraM methods. The subjects of the research include patients with head and neck, lung, and breast cancers, as well as non-malignant diseases, such as ataxia and irritable bowel syndrome. All these projects are multi-institutional or international collaborations. Each project has a different research area of focus, and the data generated from these projects are at different stages of progress. Table 2 outlines the profiles of domain data distributions, source data origins, and the data quality descriptions (tokenization, normalization, and standardization levels) of these projects. Despite the diversity of research subjects, we have not yet encountered a case that required us to create a project-specific entity to meet its data integration demand. All of these projects share a common logic as described in Figure 1, differing only in their domain activity preferences. However, we have been constantly challenged with data heterogeneity from various research domains across the country and world. Common problems resulting from this heterogeneity can be summarized as the inconsistencies in concept extraction, data modeling, and vocabulary used and the discontinuity of data derived from the same cohort but collected in various research domains. In the following discussions, we will use a real case to detail how the TraM system is used to resolve these problems.

Table 2.

Snapshot of data collected from translational research projects ¹

Domains covered in TraM	Cancer Genetics	Breast SPORE	Head & Neck Cancer	Lung Cancer	IBS²	ATAXIA	CIHDR³
Medical Survey	X (Genetic)	X (Epidemiology)		X (Environment)	X (Trial Follow Up)	X (Genetic)	X (Social)
Demographics	X	X	X	X	X	X	X
Family Pedigree	X					X	X
Physical Exam					X	X
Clinical Lab Exam					X	X
Imaging Exam	X	X	X	X		X	X
Clinical Diagnosis	X	X	X	X	X	X	X
Cancer Staging	X	X	X				X
Metastasis	X	X
Pathological Diagnosis	X	X	X	X			X
Clinical Treatment/Trial	X	X	X	X	X		X
Medicine/Chemo	X	X	X	X	X		X
Radiation	X	X	X	X			X
Surgical	X	X	X	X			X
Other	X
Response Evaluation	X	X	X	X	X
Follow Up			X		X
Adverse Event			X		X
Biospecimen	X	X	X		X	X
Biosample	X	X	X			X
Biomarker	X	X				X
Basic Research
Genotyping	X	X				X
Other	X
Data Quality⁴
Tokenization	+++	++	++	++	+	+++	++
Standardization	+	+	+	+	+	+	+
TE Exist	+++	++	+	+	+++	+++	++
Data Source⁵
Geographic Location	Domestic/Foreign	Domestic/Foreign	Cross Institutes	Cross Institutes	Foreign	Cross Institutes	Domestic (Regional)
Involved SDBs^a	31	10	7	2	2	2	4
Storage Forms^b	DBMS/Excel/Paper	DBMS/Excel/Paper	Excel	Excel	Excel/Paper	ACCESS/Excel	DBMS/Excel/Paper

Open in a new tab

The data presented in this table do not indicate complete data of the research plans nor imply all research activities of each of these projects: X = data presence; Null (empty) = data absence

IBS (Irritable Bowel Syndrome)

CIHDR (Center for Interdisciplinary Health Disparities Research): This project shares a portion of data from the Cancer Genetics account

⁴

Quality Scale: Ranging from “+” (lowest) to “+++++” (highest)

^5a

Some projects have already collected data from different sources using ACCESS or Excel. We only count these secondary data sources and do not track the number of original sources.

DBMS indicates any of the followings: Oracle, Cybase, MySQL, and ACCESS

6.2 A use case

A genetic epidemiologic study of breast cancer is a typical translational research project that involves multiple research domains. In this study data have been collected for more than 10 years from a field site in Ibadan, Nigeria. This data pool is only a small portion of data in TraM’s “cancer genetics” account and is referred to as the NG data. Panel A of Table 3 summarizes the status of this data pool before integration. It contains 20 segregated datasets in different storage forms belonging to five research domains. Examples of these datasets include data from genetic epidemiologic surveys, specimen banking, clinical diagnoses, pathology reviews, and genotyping studies. Many records are generated in collaboration with scientists in Nigeria. Even though the quantity of domain data in this data pool is not huge, it is still extremely time consuming and labor intensive for a researcher to track individuals’ data over different source databases or datasets. The valuable information embedded in this data pool therefore has not yet been fully extracted and utilized. In order to effectively extract useful information from the NG data, we need first to integrate these segregated datasets so that researchers can search them undisturbed across different domains.

Table 3.

Comparison of Nigerian (NG) data before and after curation and integration

Data fields	A: Before curation and integration					B: After curation and integration¹
Data fields	Persons	Entries	Storage forms	Datasets	Origin	Integrated²	Included³	Recovered⁴	Excluded⁵
Person demographics	1577	1577	ACCESS; Excel; Paper	4	Nigeria	1,577	1,577	0	0
Case	905	981	Paper	1	Nigeria	1,394	829	565	76
Specimen	210	1,588	Excel; Paper	3	Nigeria	1,383	209	1,174	1
Sample	1,460	1,447	Excel; Paper	2	Nigeria	1,379	1,367	12	93
Epidemiology survey	591	106,019	ACCESS; Excel; Paper	3	Nigeria	591	591	0	0
Clinical diagnosis	637	637	Excel; Paper	3	Nigeria	744	636	108	1
Pathology diagnosis	771	774	Excel; Paper	3	United States	771	771	0	3
Genotype	814	15,010	Excel	1	United States	744	744	0	70

Open in a new tab

All the numbers in Panel B are normalized to person counts: Person counts listed in different data fields (rows) come from the same cohort and they are aligned to the same persons across the fields. For example, the meaning of 744 person counts in Genotype field indicates that 744 persons out of 1,577 persons who have demongraphics records also have genotype records.

Integrated: Person counts of each data field of NG project in TraM, which are the sum of person counts from Included and Recovered columns.

Included: Person counts of data from original source file, which should be equal or less than the counts in the Persons column of Panel A.

⁴

Recovered: Person counts of data recovered during curation and integration process.

⁵

Excluded: Person counts of data disqualified for TraM from source data file, which are the substraction of person counts in Included column from the person counts in Persons column of Panel A. These data do not have any required sample donors’ information.

6.3 Data curation and translation

It took a physician, a biologist, and a bioinformatician two months to complete the NG data integration. The physician, who served the role of a curator, was responsible for verifying clinical and pathology diagnostic data collected from Nigeria and developing a questionnaire ontology for the survey data that were stored in a locally developed ACCESS database. It took this curator a dedicated four week period to build a genetic epidemiology survey questionnaire in a DO structure required by the TraM data model (Fig 2B). As a result, the original flat questionnaire structure and loosely defined survey responses were replaced by consistent vocabularies and systematic organization. At this stage, each new question is mapped to the old questions for data integration purpose.

The old survey data were collected from multiple versions of a locally developed ACCESS database. They were all mapped to the newly defined question answers and correspond to the definitions of questionnaire ontology. The same curator carried out this task to provide the mapping table between the new answer and the old ones.

Simultaneously, the biologist, who also served as a curator, took about two weeks to collect and verify the other domain data, which included bio-specimens, DNA samples, and genotype records.

The rest of the procedures that are diagramed in Fig 3 (section 4.3) were computed. Most scripts developed in this process are reusable with minor configuration, regardless of source data origins. In particular, the scripts for converting and reassembling survey data from a non-ontology structure to a DO structure have been reused many times.

6.4 Integrated data

After integration, the integrity and continuity of NG data were significantly improved. A detailed assessment on these data is summarized in Table 3, panel B. All the numbers provided in panel B were normalized to person counts in comparison with the data before curation and integration (in panel A). At this stage, the qualified NG data from various domains are connected to each other. The TraM curator GUI allows curators to further improve data from this point.

Integrated data in TraM allow researchers to effectively answer important translational research questions, such as the question samples described in the Introduction section. The information extracted from the TraM data shown in Fig 7 can now answer these questions. The answers corresponding to each of the questions are indicated within the parenthesis in the following descriptions: Question 1 (Fig 7A): how many patients (“total record 506” is a number normalized to person counts), who were diagnosed with cancer “A” (breast cancer) and had pathology records available (data in “histo_diagno” column), share a genetic profile “B” (alleles of UDP-glucuronosyltransferase (UGT) gene in genotype column). Question 2 (Fig 7B): which patients, who were diagnosed with a special cancer type “C” (ductal carcinoma) and had clinical treatment records “D” (chemo, surgical, and radiation therapies and dates), share a distinct biomarker “E” (estrogen receptor positive) and have genetic pedigree maps (progeny field holds pedigree identifiers). Question 3 (Fig 7A and 7B): whether these patients have tissue or DNA samples available (specimen and sample columns) and where these samples can be obtained (the barcodes associated with samples) for further studies. These results are normalized when exported in table format and ready for statistical analysis.

7 Discussion

7.1 Lessons learned in practice

7.1.1 Duration of a data integration process and members of a data translation team

Duration of a data integration process varies depending upon the involved research domains, data qualities (e.g., homogeneity, standardization, and tokenization status), knowledge and experience of curators, and thoroughness of a data integration plan. Quantity of data usually does not have much impact on the duration of a data integration process. Homogeneous data integration can be fully automated even if the size of the data is huge, such as genotype data. The ability of converting source data in heterogeneous structures into TraM required configurations makes a significant difference in data translation efficiency. Handling this process demands deep understanding of the nature and meaning of the data, as well as computation knowledge. A best data translation team should include a dedicated biomedical informatician (generic for all the projects) and one or two curators (project specific, not necessarily full time) depending on the knowledge scope required. With such a team, we are able to accomplish batch data integration for the projects described in Table 2 within a time span of a few weeks to four months. It is worthy of noting that this calculation is based upon the first round of data integration. Later updates can be much more efficient since the required domain ontology has been established and the data translation tools can be reused.

7.1.2 Recognize limitations of current technology and accept curator role

Human intervention is needed to correct and improve translational data. The role of curator is often underestimated in grant applications for both informatics and translational research. It is unreasonable to expect that a substantial amount of free-text data in clinical SDBs can be extracted, tokenized, and standardized solely by a super smart “natural language processing” tool. For example, a pure computation process (without a sophisticated semantic mapping database) cannot figure out that the text of “CHOP” in a chemotherapy free-text field stands for “ADRIAMYCIN, CYCLOPHOSPHAMIDE, PREDNISONE, VINCRISTINE” and that “MVAC” stands for “METHOTREXATE, VINBLASTINE, ADRIAMYCIN, and CISPLATIN.” In our experience, it took a curator seven weeks of intensive effort to figure out 11,075 distinct regimen records like these. After these records were standardized and (computationally) sorted in alphabetical order, we determined that these 11,075 distinct regimen records actually only represent 206 types of chemotherapy drug combinations. This is a typical example of how a curator plays a critical role in a data integration process.

In general, curators are typically responsible for the following six tasks. 1) Identifying source data: A curator should be clear which domain data are needed for a translational research plan. 2) Identifying the data source: Each translational research project has its preferred collaborators so its data sources can be in different institutions across the country or around the world. A data integration system, such as TraM, can not predict or directly interact with those SDBs. 3) Recovering required missing data: In reality, the majority of disparate domain SDBs do not recruit data for a particular population in a synchronized manner, nor are they necessarily designed to keep TE for other research domains. It is very likely that data from different SDBs cannot be integrated without human intervention. 4) Verifying contradictory data: Data from different sources or versions of the same source may conflict with each other in either expression or substance. To ensure reliability of records, curators need to resolve such issues to the best of their knowledge. 5) Improving data after integration: Typically, data still needs to be improved even after integration, since some problems are often not detectable when data are segregated but will be clearly revealed when data are integrated. 6) Collaborating with programmers for process automation.

7.2 Applications

The performance of the TraM system is independent of the SDB environment since source data, not SDB interoperability, is the only requirement for its functions. The system is best suited for research that spans the bidirectional continuum between the bench and the bedside. A unique aspect of this system is the integration of basic science investigation with clinical trial data as well as medical (epidemiologic, genetic, environmental, and social impact) field work questionnaires. In addition, it can be potentially applied to a number of other purposes outlined below:

7.2.1 Monitor and manage research activity and plan for new studies

Data availability reflects research progress and activity. Being able to monitor research activity through a single computational platform will modernize research management and planning. Integrated, continued, and verified biomedical data associated with individuals allow researchers to more effectively discover underlying scientific mechanisms and propose new aims or adjust methodologies. In this sense, the TraM system may not only be a data integration tool, but also a tool that facilitates data mining and research planning.

7.2.2 Transform raw data into reliable knowledge

The knowledge that translational research reveals is precious and evolving. Data always need to be revisited and updated when a new discovery or interpretation emerges. With continuous curation and enrichment of information, reliable knowledge that data represent is invaluable. Examples of transformation from a raw data repository to a knowledgebase include GenBank to Entrez Gene and PIR to UniProt. In these cases, curation activities behind the scenes have been playing an important role in making this transformation. The TraM system provides full-scale curation functions so it can be potentially useful for a translational knowledgebase construction.

7.2.3 Contribute to and benefit from a service-oriented computation network

The TraM system, after a service interface is enabled, can be a key node from which to share of high-quality translational data for an interoperable data grid. From another aspect, it can also benefit from this service-oriented grid to increase automation of the data integration process.

7.3 Limitations of the TraM approach

7.3.1 The system adoption relies on adopter’s data translation ability

We chose a data warehouse architecture to build the TraM system for our intended goals. The system has been working as we expected. Although the performance of the system is SDB environment independent, it still relies on the informatics expertise of the system adopters to translate data from sources to TraM. This dependency may affect TraM’s adoption in all institutions. We are in the process of improving the TraM data translation toolkit and will eventually make the tools available for the TraM adopters. We expect the toolkit will improve the system adaptability but also realize the toolkit does not provide plug and play functions. Informatics effort is needed to customize data translation tools for each set of special source data.

7.3.2 The ability to create a domain ontology may affect the clarity of data concepts

Integration of DO structures into a typical ERM significantly boosts the data capture flexibility, but it may also increase the chances of recruiting inconsistent vocabulary and generating concept/data redundancy (when the same concept is described by different vocabularies or vice versa). Even though only experienced and knowledgeable curators are allowed to build concept for a DO, domain experts and a nomenclature committee are eventually required to align and improve the contents of DOs. The advantage of TraM is that the data can be easily aligned to the new DO structure without going through data integration process again.

7.3.3 The traceability from the TraM data to source data

A physical copy of data in the TraM system allows data curation and supports excellent query performance, but it may also be a liability for the system. In addition to lack of real-time information, the integrated data, after being cleansed, verified, and standardized, may look or actually be different from the source data. To minimize this problem, TraM has developed a curation log and also permanently stores the mapping records between the TraM PubID and the HIPAA-protected identifiers behind a firewall to maintain data traceability. However, since the source data in their original format are mostly inaccessible due to HIPAA regulations, regular TraM users are not able to view the original data as they exist through hyperlinks, a convenience that most biological molecular databases provide.

7.3.4 Not suitable for the demand of real-time data or domain-specific operation

Although the TraM system is intended to support a wide range of translational research, it is not recommended for real-time data seekers or specific domain-limited investigators, since the system does not directly interoperate with SDBs and largely simplifies domain operational details. Despite this, our streamlined data coverage approach is well matched to the needs of translational researchers we have informally investigated.

8 Conclusion

This paper introduces a curation-enabled and warehouse-based data integration strategy for translational research. It appears to be a functioning and sustainable approach for translational researchers to manage and utilize their data. In years to come, currently disconnected and non-transparent biomedical SDBs may undergo an interoperable data sharing process and users may eventually be able to successfully view various domain data from a unified platform without migrating data. However, unless there is a mechanism to regulate the constituent SDBs so they can actively recruit data for a particular translational research plan in a synchronized manner, the discontinuity, redundancy, and missing required data elements and contents for targeted individuals will likely still occur. Therefore, with a service-enabled computation environment, our system can work more efficiently to ensure high-quality of personalized translational data integration.

Acknowledgments

This work was supported by grants 1-P50-CA125183 and 1-R01-CA89085-01A1 to Olufunmilayo I Olopade and by the Ralph and Marion Falk Medical Research Trust. We thank Ian Foster for his thoughtful comments; Dezheng Huo, David Malaka, and Andrey Khramtsov for domain data collection; Michelle Porcellino for editing; and Greg Cross for system support.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1.Guttmacher AE, Collins FS. Genomic medicine--a primer. N Engl J Med. 2002 Nov 7;347(19):1512–20. doi: 10.1056/NEJMra012240. [DOI] [PubMed] [Google Scholar]
2.Khoury MJ, McCabe LL, McCabe ER. Population screening in the age of genomic medicine. N Engl J Med. 2003 Jan 2;348(1):50–8. doi: 10.1056/NEJMra013182. [DOI] [PubMed] [Google Scholar]
3.Alesci S, Chrousos GP, Pacak K. Genomic medicine: exploring the basis of a new approach to endocrine hypertension. Ann N Y Acad Sci. 2002 Sep;970:177–92. doi: 10.1111/j.1749-6632.2002.tb04424.x. [DOI] [PubMed] [Google Scholar]
4.Hopkins MM. Putting pharmacogenetics into practice. Nat Biotechnol. 2006 Apr;24(4):403–10. doi: 10.1038/nbt0406-403. [DOI] [PubMed] [Google Scholar]
5.Collins I, Workman P. New approaches to molecular cancer therapeutics. Nat Chem Biol. 2006 Dec;2( 12):689–700. doi: 10.1038/nchembio840. [DOI] [PubMed] [Google Scholar]
6.Evans WE, Relling MV. Moving towards individualized medicine with pharmacogenomics. Nature. 2004 May 27;429(6990):464–8. doi: 10.1038/nature02626. [DOI] [PubMed] [Google Scholar]
7.Webb CP, Pass HI. Translation research: from accurate diagnosis to appropriate treatment. J Transl Med. 2004 Oct 21;2(1):35. doi: 10.1186/1479-5876-2-35. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Sanchez-Serrano I. Success in translational research: lessons from the development of bortezomib. Nat Rev Drug Discov. 2006 Feb;5(2):107–14. doi: 10.1038/nrd1959. [DOI] [PubMed] [Google Scholar]
9.Jain KK. Challenges of drug discovery for personalized medicine. Curr Opin Mol Ther. 2006 Dec;8(6):487–92. [PubMed] [Google Scholar]
10.Geddes D. Translational research--from gene to treatment: lessons from cystic fibrosis. Clin Med. 2005 May-Jun;5(3):258–63. doi: 10.7861/clinmedicine.5-3-258. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Gaughan A. Bridging the divide: the need for translational informatics. Pharmacogenomics. 2006 Jan;7(1):117–22. doi: 10.2217/14622416.7.1.117. [DOI] [PubMed] [Google Scholar]
12.Horig H, Marincola E, Marincola FM. Obstacles and opportunities in translational research. Nat Med. 2005 Jul;11(7):705–8. doi: 10.1038/nm0705-705. [DOI] [PubMed] [Google Scholar]
13.Sabroe I, Dockrell DH, Vogel SN, Renshaw SA, Whyte MK, Dower SK. Identifying and hurdling obstacles to translational research. Nat Rev Immunol. 2007 Jan;7(1):77–82. doi: 10.1038/nri1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Payne PR, Johnson SB, Starren JB, Tilson HH, Dowdy D. Breaking the translational barriers: the value of integrating biomedical informatics and translational research. J Investig Med. 2005 May;53(4):192–200. doi: 10.2310/6650.2005.00402. [DOI] [PubMed] [Google Scholar]
15.Yu AC. Methods in biomedical ontology. J Biomed Inform. 2006 Jun;39(3):252–66. doi: 10.1016/j.jbi.2005.11.006. [DOI] [PubMed] [Google Scholar]
16.Chen PP-S. The entity-relationship model-toward a unified view of data. ACM Transactions on Database Systems. 1976 March;1(1):9–36. [Google Scholar]
17.http://en.wikipedia.org/wiki/Entity-relationship_diagram
18.Knudsen TB, Daston GP. Teratology Society. MIAME guidelines. Reprod Toxicol. 2005 Jan-Feb;19(3):263. doi: 10.1016/j.reprotox.2004.10.004. [DOI] [PubMed] [Google Scholar]
19.Sujansky W. Heterogeneous database integration in biomedicine. J Biomed Inform. 2001 Aug;34(4):285–98. doi: 10.1006/jbin.2001.1024. [DOI] [PubMed] [Google Scholar]
20.Louie B, Mork P, Martin-Sanchez F, Halevy A, Tarczy-Hornoch P. Data integration and genomic medicine. J Biomed Inform. 2007 Feb;40(1):5–16. doi: 10.1016/j.jbi.2006.02.007. [DOI] [PubMed] [Google Scholar]
21.Turisco F, Keogh D, Stubbs C, Glaser J, Crowley WF., Jr Current status of integrating information technologies into the clinical research enterprise within US academic health centers: strategic value and opportunities for investment. J Investig Med. 2005 Dec;53(8):425–33. doi: 10.2310/6650.2005.53806. [DOI] [PubMed] [Google Scholar]
22.Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Marshall MS, Ogbuji C, Rees J, Stephens S, Wong GT, Wu E, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung KH. Advancing translational research with the Semantic Web. BMC Bioinformatics. 2007 May 9;8( Suppl 3):S2. doi: 10.1186/1471-2105-8-S3-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Alonso-Calvo R, Maojo V, Billhardt H, Martin-Sanchez F, Garcia-Remesal M, Perez-Rey D. An agent- and ontology-based system for integrating public gene, protein, and disease databases. J Biomed Inform. 2007 Feb;40(1):17–29. doi: 10.1016/j.jbi.2006.02.014. [DOI] [PubMed] [Google Scholar]
24.Arbona A, Benkner S, Engelbrecht G, Fingberg J, Hofmann M, Kumpf K, Lonsdale G, Woehrer A. A service-oriented grid infrastructure for biomedical data and compute services. IEEE Trans Nanobioscience. 2007 Jun;6(2):136–41. doi: 10.1109/tnb.2007.897438. [DOI] [PubMed] [Google Scholar]
25.Maojo V, Crespo J, de la Calle G, Barreiro J, Garcia-Remesal M. Using web services for linking genomic data to medical information systems. Methods Inf Med. 2007;46(4):484–92. doi: 10.1160/me9056. [DOI] [PubMed] [Google Scholar]
26.Komatsoulis GA, Warzel DB, Hartel FW, Shanbhag K, Chilukuri R, Fragoso G, Coronado S, Reeves DM, Hadfield JB, Ludet C, Covitz PA. caCORE version 3: Implementation of a model driven, service-oriented architecture for semantic interoperability. J Biomed Inform. 2008 Feb;41(1):106–23. doi: 10.1016/j.jbi.2007.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Angulo C, Crespo P, Maldonado JA, Moner D, Perez D, Abad I, Mandingorra J, Robles M. Non-invasive lightweight integration engine for building EHR from autonomous distributed systems. Stud Health Technol Inform. 2006;124:173–8. doi: 10.1016/j.ijmedinf.2007.05.002. [DOI] [PubMed] [Google Scholar]
28.Karthikeyan M, Krishnan S, Pandey AK, Bender A. Harvesting chemical information from the Internet using a distributed approach: ChemXtreme. J Chem Inf Model. 2006 Mar-Apr;46(2):452–61. doi: 10.1021/ci050329+. [DOI] [PubMed] [Google Scholar]
29.Martone ME, Gupta A, Ellisman MH. E-neuroscience: challenges and triumphs in integrating distributed data from molecules to brains. Nat Neurosci. 2004 May;7(5):467–72. doi: 10.1038/nn1229. [DOI] [PubMed] [Google Scholar]
30.Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007 May 15;23(10):1282–8. doi: 10.1093/bioinformatics/btm098. [DOI] [PubMed] [Google Scholar]
31.Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2007 Jan;35(Database issue):D26–31. doi: 10.1093/nar/gkl993. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000 Jan 1;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007 Jan;35(Database issue):D61–5. doi: 10.1093/nar/gkl842. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D514–7. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inform. 2007 Feb;40(1):30–43. doi: 10.1016/j.jbi.2006.02.013. [DOI] [PubMed] [Google Scholar]
36.Brazhnik O, Jones JF. Anatomy of data integration. J Biomed Inform. 2007 Jun;40(3):252–69. doi: 10.1016/j.jbi.2006.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Burek P, Hoehndorf R, Loebe F, Visagie J, Herre H, Kelso J. A top-level ontology of functions and its application in the Open Biomedical Ontologies. Bioinformatics. 2006 Jul 15;22(14):e66–73. doi: 10.1093/bioinformatics/btl266. [DOI] [PubMed] [Google Scholar]
38.Malin B, Sweeney L. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. J Biomed Inform. 2004 Jun;37(3):179–92. doi: 10.1016/j.jbi.2004.04.005. [DOI] [PubMed] [Google Scholar]
39.Richardson DB. The impact on relative risk estimates of inconsistencies between ICD-9 and ICD-10. Occup Environ Med. 2006 Nov;63(11):734–40. doi: 10.1136/oem.2006.027243. Epub 2006 May 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Bodenreider O, Smith B, Kumar A, Burgun A. Investigating subsumption in SNOMED CT: an exploration into large description logic-based biomedical terminologies. Artif Intell Med. 2007 Mar;39( 3):183–95. doi: 10.1016/j.artmed.2006.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Bodenreider O. Circular hierarchical relationships in the UMLS: etiology, diagnosis, treatment, complications and prevention. Proc AMIA Symp. 2001:57–61. [PMC free article] [PubMed] [Google Scholar]
42.Mougin F, Burgun A, Bodenreider O. Using WordNet to improve the mapping of data elements to UMLS for data sources integration. AMIA Annu Symp Proc. 2006:574–8. [PMC free article] [PubMed] [Google Scholar]
43.Fung KW, Hole WT, Nelson SJ, Srinivasan S, Powell T, Roth L. Integrating SNOMED CT into the UMLS: an exploration of different views of synonymy and quality of editing. J Am Med Inform Assoc. 2005 Jul-Aug;12(4):486–94. doi: 10.1197/jamia.M1767. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267–70. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308–11. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Zloof M. Query by example: the innovation and definition of tables and forms. Proceedings of the 1st International Conference on Very Large Data Bases; 1975. pp. 1–24. [Google Scholar]
47.http://en.wikipedia.org/wiki/Query_by_Example.

[R1] 1.Guttmacher AE, Collins FS. Genomic medicine--a primer. N Engl J Med. 2002 Nov 7;347(19):1512–20. doi: 10.1056/NEJMra012240. [DOI] [PubMed] [Google Scholar]

[R2] 2.Khoury MJ, McCabe LL, McCabe ER. Population screening in the age of genomic medicine. N Engl J Med. 2003 Jan 2;348(1):50–8. doi: 10.1056/NEJMra013182. [DOI] [PubMed] [Google Scholar]

[R3] 3.Alesci S, Chrousos GP, Pacak K. Genomic medicine: exploring the basis of a new approach to endocrine hypertension. Ann N Y Acad Sci. 2002 Sep;970:177–92. doi: 10.1111/j.1749-6632.2002.tb04424.x. [DOI] [PubMed] [Google Scholar]

[R4] 4.Hopkins MM. Putting pharmacogenetics into practice. Nat Biotechnol. 2006 Apr;24(4):403–10. doi: 10.1038/nbt0406-403. [DOI] [PubMed] [Google Scholar]

[R5] 5.Collins I, Workman P. New approaches to molecular cancer therapeutics. Nat Chem Biol. 2006 Dec;2( 12):689–700. doi: 10.1038/nchembio840. [DOI] [PubMed] [Google Scholar]

[R6] 6.Evans WE, Relling MV. Moving towards individualized medicine with pharmacogenomics. Nature. 2004 May 27;429(6990):464–8. doi: 10.1038/nature02626. [DOI] [PubMed] [Google Scholar]

[R7] 7.Webb CP, Pass HI. Translation research: from accurate diagnosis to appropriate treatment. J Transl Med. 2004 Oct 21;2(1):35. doi: 10.1186/1479-5876-2-35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Sanchez-Serrano I. Success in translational research: lessons from the development of bortezomib. Nat Rev Drug Discov. 2006 Feb;5(2):107–14. doi: 10.1038/nrd1959. [DOI] [PubMed] [Google Scholar]

[R9] 9.Jain KK. Challenges of drug discovery for personalized medicine. Curr Opin Mol Ther. 2006 Dec;8(6):487–92. [PubMed] [Google Scholar]

[R10] 10.Geddes D. Translational research--from gene to treatment: lessons from cystic fibrosis. Clin Med. 2005 May-Jun;5(3):258–63. doi: 10.7861/clinmedicine.5-3-258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Gaughan A. Bridging the divide: the need for translational informatics. Pharmacogenomics. 2006 Jan;7(1):117–22. doi: 10.2217/14622416.7.1.117. [DOI] [PubMed] [Google Scholar]

[R12] 12.Horig H, Marincola E, Marincola FM. Obstacles and opportunities in translational research. Nat Med. 2005 Jul;11(7):705–8. doi: 10.1038/nm0705-705. [DOI] [PubMed] [Google Scholar]

[R13] 13.Sabroe I, Dockrell DH, Vogel SN, Renshaw SA, Whyte MK, Dower SK. Identifying and hurdling obstacles to translational research. Nat Rev Immunol. 2007 Jan;7(1):77–82. doi: 10.1038/nri1999. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Payne PR, Johnson SB, Starren JB, Tilson HH, Dowdy D. Breaking the translational barriers: the value of integrating biomedical informatics and translational research. J Investig Med. 2005 May;53(4):192–200. doi: 10.2310/6650.2005.00402. [DOI] [PubMed] [Google Scholar]

[R15] 15.Yu AC. Methods in biomedical ontology. J Biomed Inform. 2006 Jun;39(3):252–66. doi: 10.1016/j.jbi.2005.11.006. [DOI] [PubMed] [Google Scholar]

[R16] 16.Chen PP-S. The entity-relationship model-toward a unified view of data. ACM Transactions on Database Systems. 1976 March;1(1):9–36. [Google Scholar]

[R17] 17.http://en.wikipedia.org/wiki/Entity-relationship_diagram

[R18] 18.Knudsen TB, Daston GP. Teratology Society. MIAME guidelines. Reprod Toxicol. 2005 Jan-Feb;19(3):263. doi: 10.1016/j.reprotox.2004.10.004. [DOI] [PubMed] [Google Scholar]

[R19] 19.Sujansky W. Heterogeneous database integration in biomedicine. J Biomed Inform. 2001 Aug;34(4):285–98. doi: 10.1006/jbin.2001.1024. [DOI] [PubMed] [Google Scholar]

[R20] 20.Louie B, Mork P, Martin-Sanchez F, Halevy A, Tarczy-Hornoch P. Data integration and genomic medicine. J Biomed Inform. 2007 Feb;40(1):5–16. doi: 10.1016/j.jbi.2006.02.007. [DOI] [PubMed] [Google Scholar]

[R21] 21.Turisco F, Keogh D, Stubbs C, Glaser J, Crowley WF., Jr Current status of integrating information technologies into the clinical research enterprise within US academic health centers: strategic value and opportunities for investment. J Investig Med. 2005 Dec;53(8):425–33. doi: 10.2310/6650.2005.53806. [DOI] [PubMed] [Google Scholar]

[R22] 22.Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Marshall MS, Ogbuji C, Rees J, Stephens S, Wong GT, Wu E, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung KH. Advancing translational research with the Semantic Web. BMC Bioinformatics. 2007 May 9;8( Suppl 3):S2. doi: 10.1186/1471-2105-8-S3-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Alonso-Calvo R, Maojo V, Billhardt H, Martin-Sanchez F, Garcia-Remesal M, Perez-Rey D. An agent- and ontology-based system for integrating public gene, protein, and disease databases. J Biomed Inform. 2007 Feb;40(1):17–29. doi: 10.1016/j.jbi.2006.02.014. [DOI] [PubMed] [Google Scholar]

[R24] 24.Arbona A, Benkner S, Engelbrecht G, Fingberg J, Hofmann M, Kumpf K, Lonsdale G, Woehrer A. A service-oriented grid infrastructure for biomedical data and compute services. IEEE Trans Nanobioscience. 2007 Jun;6(2):136–41. doi: 10.1109/tnb.2007.897438. [DOI] [PubMed] [Google Scholar]

[R25] 25.Maojo V, Crespo J, de la Calle G, Barreiro J, Garcia-Remesal M. Using web services for linking genomic data to medical information systems. Methods Inf Med. 2007;46(4):484–92. doi: 10.1160/me9056. [DOI] [PubMed] [Google Scholar]

[R26] 26.Komatsoulis GA, Warzel DB, Hartel FW, Shanbhag K, Chilukuri R, Fragoso G, Coronado S, Reeves DM, Hadfield JB, Ludet C, Covitz PA. caCORE version 3: Implementation of a model driven, service-oriented architecture for semantic interoperability. J Biomed Inform. 2008 Feb;41(1):106–23. doi: 10.1016/j.jbi.2007.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Angulo C, Crespo P, Maldonado JA, Moner D, Perez D, Abad I, Mandingorra J, Robles M. Non-invasive lightweight integration engine for building EHR from autonomous distributed systems. Stud Health Technol Inform. 2006;124:173–8. doi: 10.1016/j.ijmedinf.2007.05.002. [DOI] [PubMed] [Google Scholar]

[R28] 28.Karthikeyan M, Krishnan S, Pandey AK, Bender A. Harvesting chemical information from the Internet using a distributed approach: ChemXtreme. J Chem Inf Model. 2006 Mar-Apr;46(2):452–61. doi: 10.1021/ci050329+. [DOI] [PubMed] [Google Scholar]

[R29] 29.Martone ME, Gupta A, Ellisman MH. E-neuroscience: challenges and triumphs in integrating distributed data from molecules to brains. Nat Neurosci. 2004 May;7(5):467–72. doi: 10.1038/nn1229. [DOI] [PubMed] [Google Scholar]

[R30] 30.Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007 May 15;23(10):1282–8. doi: 10.1093/bioinformatics/btm098. [DOI] [PubMed] [Google Scholar]

[R31] 31.Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2007 Jan;35(Database issue):D26–31. doi: 10.1093/nar/gkl993. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000 Jan 1;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007 Jan;35(Database issue):D61–5. doi: 10.1093/nar/gkl842. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D514–7. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inform. 2007 Feb;40(1):30–43. doi: 10.1016/j.jbi.2006.02.013. [DOI] [PubMed] [Google Scholar]

[R36] 36.Brazhnik O, Jones JF. Anatomy of data integration. J Biomed Inform. 2007 Jun;40(3):252–69. doi: 10.1016/j.jbi.2006.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Burek P, Hoehndorf R, Loebe F, Visagie J, Herre H, Kelso J. A top-level ontology of functions and its application in the Open Biomedical Ontologies. Bioinformatics. 2006 Jul 15;22(14):e66–73. doi: 10.1093/bioinformatics/btl266. [DOI] [PubMed] [Google Scholar]

[R38] 38.Malin B, Sweeney L. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. J Biomed Inform. 2004 Jun;37(3):179–92. doi: 10.1016/j.jbi.2004.04.005. [DOI] [PubMed] [Google Scholar]

[R39] 39.Richardson DB. The impact on relative risk estimates of inconsistencies between ICD-9 and ICD-10. Occup Environ Med. 2006 Nov;63(11):734–40. doi: 10.1136/oem.2006.027243. Epub 2006 May 25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Bodenreider O, Smith B, Kumar A, Burgun A. Investigating subsumption in SNOMED CT: an exploration into large description logic-based biomedical terminologies. Artif Intell Med. 2007 Mar;39( 3):183–95. doi: 10.1016/j.artmed.2006.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Bodenreider O. Circular hierarchical relationships in the UMLS: etiology, diagnosis, treatment, complications and prevention. Proc AMIA Symp. 2001:57–61. [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Mougin F, Burgun A, Bodenreider O. Using WordNet to improve the mapping of data elements to UMLS for data sources integration. AMIA Annu Symp Proc. 2006:574–8. [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Fung KW, Hole WT, Nelson SJ, Srinivasan S, Powell T, Roth L. Integrating SNOMED CT into the UMLS: an exploration of different views of synonymy and quality of editing. J Am Med Inform Assoc. 2005 Jul-Aug;12(4):486–94. doi: 10.1197/jamia.M1767. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267–70. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308–11. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Zloof M. Query by example: the innovation and definition of tables and forms. Proceedings of the 1st International Conference on Very Large Data Bases; 1975. pp. 1–24. [Google Scholar]

[R47] 47.http://en.wikipedia.org/wiki/Query_by_Example.

PERMALINK

Translational Integrity and Continuity: Personalized Biomedical Data Integration

Xiaoming Wang

Lili Liu

James Fackenthal

Shelly Cummings

Maggie Cook

Kisha Hope

Jonathan C Silverstein

Olufunmilayo I Olopade

Abstract

1 Introduction

2 Terminology used in this paper

3 Background

3.1 Translational data status and domain database systems

3.2 Data organization architectures for data integration

Table 1.

3.3 Data integration methods

4 Methods

4.1 Anatomy of translational data

Fig. 1.

4.2 Data modeling

4.2.1 The conceptual data model

Fig. 2.

4.2.2 The logical data model

4.2.3 The physical data model

4.3 Data integration workflow

Fig. 3.

5 System

5.1 System components

Fig. 4.

5.2 TraM database

5.2.1 Streamlined data coverage

5.2.2 Data dependency control

5.2.3 Identifying HIPAA compliant data

5.2.4 Terminology adoption and classification

5.2.5 One schema for multiple medical specialties

5.3 TraM application system

5.3.1 Account management module

Fig. 5.

5.3.2 Curation modules

Fig. 6.

5.3.3 Query modules

Fig. 7.

5.4 Data translation toolkit

6 Case Study

6.1 Data integration project overview

Table 2.

6.2 A use case

Table 3.

6.3 Data curation and translation

6.4 Integrated data

7 Discussion

7.1 Lessons learned in practice

7.1.1 Duration of a data integration process and members of a data translation team

7.1.2 Recognize limitations of current technology and accept curator role

7.2 Applications

7.2.1 Monitor and manage research activity and plan for new studies

7.2.2 Transform raw data into reliable knowledge

7.2.3 Contribute to and benefit from a service-oriented computation network

7.3 Limitations of the TraM approach

7.3.1 The system adoption relies on adopter’s data translation ability

7.3.2 The ability to create a domain ontology may affect the clarity of data concepts

7.3.3 The traceability from the TraM data to source data

7.3.4 Not suitable for the demand of real-time data or domain-specific operation

8 Conclusion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases