Harmonization and Semantic Annotation of Data Dictionaries from the Pharmacogenomics Research Network: a case study

Qian Zhu; Robert R Freimuth; Zonghui Lian; Scott Bauer; Jyotishman Pathak; Cui Tao; Matthew J Durski; Christopher G Chute

doi:10.1016/j.jbi.2012.11.004

. Author manuscript; available in PMC: 2014 Apr 1.

Published in final edited form as: J Biomed Inform. 2012 Nov 29;46(2):286–293. doi: 10.1016/j.jbi.2012.11.004

Harmonization and Semantic Annotation of Data Dictionaries from the Pharmacogenomics Research Network: a case study

Qian Zhu ^1,^✉, Robert R Freimuth ², Zonghui Lian ³, Scott Bauer ⁴, Jyotishman Pathak ⁵, Cui Tao ⁶, Matthew J Durski ⁷, Christopher G Chute ⁸

PMCID: PMC3606279 NIHMSID: NIHMS425586 PMID: 23201637

Abstract

The Pharmacogenomics Research Network (PGRN) is a collaborative partnership of research groups funded by NIH to discover and understand how genome contributes to an individual’s response to medication. Since traditional biomedical research studies and clinical trials are often conducted independently, common and standardized representations for data are seldom used. This leads to heterogeneity in data representation, which hinders data reuse, data integration and meta-analyses.

This study demonstrates harmonization and semantic annotation work for pharmacogenomics data dictionaries collected from PGRN research groups. A semi-automated system was developed to support the harmonization/annotation process, which includes four individual steps, 1) pre-processing PGRN variables; 2) decomposing and normalizing variable descriptions; 3) semantically annotating words and phrases using controlled terminologies; 4) grouping PGRN variables into categories based on the annotation results and semantic types, for total 1514 PGRN variables.

Our results demonstrate that there is a significant amount of variability in how pharmacogenomics data is represented and that additional standardization efforts are needed. This represents a critical first step toward identifying and creating data standards for pharmacogenomics studies.

Keywords: Data harmonization, semantic annotation, Pharmacogenomics

1. Introduction

As biomedical research becomes more collaborative, challenges that arise when exchanging data among research groups becomes more pronounced. One of the primary, yet most fundamental, challenges in exchanging and integrating data is to ensure that data is both semantically (i.e., variable names and values share common meanings) and syntactically (i.e., the data shares a common format) interoperable. Incompatibilities often arise as a result of differences in the way research groups define and represent data. Overcoming these barriers usually requires one-to-one mappings and transformations between data sets. A more scalable approach is to define and use data standards, which ensure that all data collected using the standards for both the same semantic meaning and syntactic representation. Such standards, however, can be difficult to define in rapidly evolving fields of study where the types of data and/or the relationships between them change frequently. In those cases, standardization usually occurs after a sufficiently large corpus of data has been collected and research methods begin to converge. This manuscript describes results from the first step of just such a standardization process. Specifically, we describe the results of a case study of data dictionary standardization from members of the Pharmacogenomics Research Network (PGRN) [1].

PGRN is a collaborative partnership of research groups funded by the U.S. National Institutes of Health to discover and understand how genome contributes to an individual’s response to medication. PGRN sites conduct very large scope of research fields, from cardiovascular---pulmonary diseases (including arrhythmias, hypertension, hypercholesterolemia, and asthma), cancers (including breast and gastrointestinal tumors and childhood leukemias), neuropsychiatric disorders (including depression and addiction), to classic determinants of drug blood levels (pathways of absorption, distribution, metabolism, elimination, and transport) [2]. There have been more than 1,000 published fundamental and clinical research studies contributing significantly to the scientific base of knowledge in pharmacogenomics [3,4], a trend that is expected to continue. However, traditional biomedical research studies and clinical trials are being conducted independently, and common and standardized representations for data are seldom used. This leads to heterogeneity in the collected data and it hinders data reuse, integration and meta-analyses across multiple datasets.

2. Motivation

The variety of disease phenotypes are studied in the PGRN, as well as differences in clinical systems in use at each PGRN site, lead to data that is heterogeneous, non-standardized, and institution-specific. This not only hinders data aggregation among collaborating sites on a given study, but also complicates or prevents secondary use of the data, e.g., in meta-analyses.

To help overcome these issues, we performed a survey of PGRN data dictionaries, which are repositories of information about the data collected for a given study. Data dictionaries describe the variables used to capture data, including their meaning, origin, usage, relationships to other variables, and format. The goals of this survey were to: 1) identify overlapping and non-overlapping variables in the PGRN data dictionaries and 2) propose standards that establish a common semantic meaning and syntactic representation for the data.

For example, Table 1 lists several variables, along with their definitions and permissible values, from the data dictionaries of two PGRN sites. All three fields exhibit considerable variation as a result of both intra- and inter-site differences. As an example of intra-site inconsistency, Site 1 defines two different variables to capture information about ethnicity of a subject’s maternal grandmother, which have different names, definitions, and permissible values. Interestingly, although the meaning of the permissible values is the same for the two variables, representation of the data is different, i.e. one variable uses integers while the other uses text. Inter-site differences between Sites 1 and 2 are also evident, as different names and permissible values are used to define the same concept. Furthermore, and perhaps most significantly, while the name and description of the variables defined by Site 1 indicate the data represents ethnicity, the values are an admixture of both ethnic and racial categories. This results in a discrepancy between variable name/description and a list of values, which will complicate interpretation and integration of the data.

Table 1.

Example of heterogeneity in data dictionaries: representation of race and ethnicity

Origin	Variable Name	Variable Description	Permissible Values
PGRN Site 1	race_matern_gm	ethnic background of your biological maternal grandmother	−8 = Not Applicable, −1 = Unknown, 1 = Caucasian (White), 2 = African American, 3 = Hispanic, 4 = Asian, 5 = Native American, 6 = Other
PGRN Site 1	mat_gm_eth	maternal grandmothers ethnicity	white, black, hispanic, native american, asian, unknown, other, not applicable
PGRN Site 2	Race	(none provided)	1 = American Indian or Alaska Native, 2 = Asian, 3 = Black or African American, 4 = Native Hawaiian or Pacific Islander, 5 = White, 6 = Unknown
PGRN Site 2	Ethnicity	(none provided)	1 = Hispanic or Latino, 2 = Not Hispanic or Latino, 3 = Unknown
OMB	Race	OMB race category (minimum designations)	American Indian or Alaska Native, Asian, Black or African American, Native Hawaiian or Pacific Islander, White
OMB	Ethnicity	OMB ethnicity category (minimum designations)	Hispanic or Latino, Not Hispanic or Latino

Open in a new tab

Concepts of race and ethnicity are distinct and well-defined. In addition, the U.S. Office of Management and Budget (OMB) established standards for reporting race and ethnicity information that are already widely used [5] (Table 1). PGRN Site 2 conforms to the OMB standard but it employs a custom coding scheme and it lacks explicit definitions for the variables. This example illustrates how data consistency and comparability would be improved if both PGRN sites used the same definition and representation for common concepts. While this is only a simple example, it is common to find similar issues with other variables. In general, we have found that data heterogeneity tends to increase with the complexity of the data, the degree to which local coding systems are used, and the level of informality of the data dictionary. The harmonization effort described in this case study represents a critical first step toward identifying and creating data standards for pharmacogenomics studies.

3. Materials and Methods

In this paper we demonstrate our approach to harmonize the data dictionaries of PGRN, which is a highly diverse research network. It emphasizes semantically annotating PGRN variables using the controlled terminologies, where possible, to avoid unnecessary proliferation of proposed standards in the biomedical research community. As shown in Figure 1, we accomplished this task including multiple steps: 1) pre-processing PGRN variables; 2) decomposing and normalizing variable descriptions; 3) semantically annotating words and phrases using controlled terminologies; 4) grouping PGRN variables into categories based on annotation results and semantic types.

3.1. Data Pre-Processing

Data dictionaries were collected from PGRN research sites. To accommodate differences in format, such as PDF, plain text, Microsoft Excel spreadsheets, html, etc., and granularity of information provided for each variable, we pre-processed each dictionary by reformatting and filling in missing data, like missing variable descriptions or value sets. Discussions were held with the dictionary owner to obtain or clarify variable descriptions, value set contents, and define abbreviations. All variables were loaded into a MySQL database for harmonization. Each variable was assigned a unique identifier that was used throughout the entire harmonization process.

3.2 Decomposition and Normalization

To provide consistent and comparable definitions for variables across research sites, terms from the controlled terminologies were used to capture semantic meaning of variable descriptions. As described below, NCBO Bioportal services were used to identify candidate terms. While the Bioportal service is designed to return both exact and partial matches, it is not designed to take long phrases, such as those typically found in data dictionaries, as input. Therefore, variable descriptions were decomposed and normalized for querying. For example, no annotation results were retrieved using whole phrase “Was the patient hospitalized for heart failure”, even after stop words (“was”, “the”, “for”) were removed. Therefore, we implemented an approach that is based on a lexical search algorithm. This approach first split each description into single words and short phrases, then removed stop words and normalized word form.

Decomposition

Variable descriptions were first split into single words, which were then reassembled into phrases. The words and phrases, which we termed “mapping components” (MCs), were ultimately used as query terms for the Bioportal service. For instance, a description containing three words (“A B C”) will produce seven MCs (A, B, C, AB, BC, AC, ABC). The length of each phrase was limited to a maximum of six single words.

Stop word removal

Many words in variable descriptions are meaningless for semantic annotation. To improve results of the Bioportal queries, we removed all words that were contained in stop words list [6] and common English words list [7]. We also removed MCs including more than or equal to 50% stop words.

Normalization

The level of formalism in data dictionaries varies greatly. To remove the colloquialism in variable definitions, speech conversion and tense normalization were implemented based on Unified Medical Language System (UMLS) Specialist Lexicon [8]. This process converted verb tense to a common base form, plural nouns to singular form, and possessive nouns to base forms using LRAGR lexicon. In addition, verbs, adjectives, and adverbs were converted to nouns using LRNOM lexicon.

Table 2 shows an example of a variable description that was decomposed and normalized. In this example, “was”, “the”, “for” were removed as stop words, and “was the”, “the patient”, “hospitalization for”, “for heart”, and “was the patient”, etc. were removed due to the percentage of stop words meeting or exceeding 50%. In addition, “hospitalized” was converted to “hospitalize” by LRAGR, and then converted to “hospitalization” by LRNOM.

Table 2.

Example of variable description decomposition and normalization

Original variable	Resulting Mapping Components (MCs)
Was the patient hospitalized for heart failure	Single words	patient, hospitalization, heart, failure
Was the patient hospitalized for heart failure	Phrases	patient hospitalization, heart failure, patient hospitalization for, hospitalization for heart, for heart failure, patient hospitalization for heart, hospitalization for heart failure, patient hospitalization for heart failure

Open in a new tab

3.3 Semantic annotation and categorization

To complete semantic annotation process, MCs generated from the previous step were used to query controlled terminologies, results were reviewed manually, and UMLS semantic types (ST) [9]for the selected terms were used to group variables into different categories.

Annotation with controlled terminologies

Based on types of data collected in pharmacogenomics domain, SNOMED-CT [10], NDF-RT [11], NCI Thesaurus [12], RxNorm [13] and LONIC [14] were selected as source terminologies for semantic annotation. NCBO BioPortal [15] provides access to many biomedical ontologies, including those selected for this study. An annotation pipeline was developed to utilize BioPortal Web services [16], which provide programmatic access to terminology content. This annotation pipeline used for MCs obtained above to query five ontologies selected for this study. Query results were returned in XML format, which were loaded into a database for manual review.

Annotation review

Annotation results were manually reviewed to ensure that semantic meaning of each corresponding variable description was captured. To facilitate such review process, a simple web application was developed that allowed curators select the best term(s) for annotation (Figure 2). The web application presented all of the terms that were returned for a given variable, using the variable’s MCs as query terms. Curators reviewed each variable description and selected term(s) that were thought to best represent semantic meaning of such variable, as indicated by the check box in the “Accepted Mapping” column in Figure 2.

Snapshot of “Variable Mapping Viewer” web interface

Following term selection, curators determined how completely selected terms captured semantic meaning of the variable. Each variable was given a status of “complete mapping”, “partial mapping” or “no mapping”. Variables with status as “complete mapping” were used in the next step, variable categorization directly, while those variables that were not sufficiently represented by the query results were flagged for further study, e.g., additional clarification of the semantic meaning with the owner of the data dictionary or manual annotation.

Categorization

To facilitate harmonization process, variables were categorized into common domains, such as demographics, medications, and laboratory results. This was accomplished by taking advantage of mappings that exist between terminologies that were used for semantic annotation and UMLS semantic types (ST). UMLS ST are organized in a hierarchical tree. As shown in Figure 3, “Disease or Syndrome” is a child node of “Pathologic Function”, and “Disease or Syndrome” is a parent node of “Mental or Behavioral Dysfunction”. ST hierarchical tree also allowed us to uniformly represent annotations at different levels of granularity, which is likely since different terminologies were utilized for annotation. For example, “Atrial Fibrillation” is a “Disease or Syndrome” in NCI Thesaurus but it is a “Pathologic Function” in SNOMED-CT and NDF-RT. ST hierarchy provides a means to identify a common category (“Pathologic Function”) for “Atrial Fibrillation” across all three terminologies.

Subset of UMLS semantic types hierarchical tree

Several domains were chosen as variable categories, which were mapped to ST categories (Table 3), based on the types of variables that were present in data dictionaries used for this study. ST of a primary concept, which was determined by the manual review of semantic annotations was then used to categorize the variable into one of the domains. Such as, “past angina”, the primary concept “angina” with disorder ST is used for categorization, and “past” is as temporal qualifier for “angina”.

Table 3.

Grouping UMLS Semantic Types into proposed domains

Domains	Relevant Semantic Types
Demographic	Organism Attribute; Organism Function
Medication	Pharmacologic Substance; Clinical Drug; Organic Chemical
Laboratory	Laboratory or Test Result; Laboratory Procedure
Disorder	Disease or Syndrome; Mental or Behavioral Dysfunction; Pathologic Function
Smoking Status	Environmental Effect of Humans
Clinical Observation	Clinical Attribute

Open in a new tab

4. Results

4.1. PGRN Data Dictionaries

A total of 1514 variables were collected from four PGRN sites. Following manual review, a number of variables were found to be highly specific and therefore less likely to be reused, or repeatedly used across dictionaries from same site. As shown in Table 4, 84 variables were classified as site-specific, many of which represented processing state or internal flags, e.g., “uploaded to database”, “Field for Skip logic”. A total of 65 variables were found to differ by only a time-based qualifier, e.g., blood pressure at visit 1, 2, or 3, and 514 variables were repeated across dictionaries. The latter category included instances of variables that were repeated to create a list, e.g., Drug 1 name, Drug 2 name, etc., and those that were identical copies in different dictionaries, thereby representing instances of variable reuse.

Table 4.

Number of special variables collected from PGRN sites

Type of Variable	Descriptions	PGRN GROUP 1	PGRN GROUP 2	PGRN GROUP 3	PGRN GROUP 4	Total
Site-Specific	Variables designed for internal use with site specific flags	74	8	1	1	84
Differ only by Time Qualifier	Variables designed for recording different results retrieved for one particular event (diagnosis, laboratory test, etc) at different time points	5	1	59	0	65
Repeated	Variables with same semantic meanings	451	49	14	0	514
Unique	Variables with different semantic meaning	317	409	107	18	851
Total		847	467	181	19	1,514

Open in a new tab

4.2 Decomposing and Normalizing

Since variable names tend to be highly abbreviated and rarely capture the full semantics of the data that they represent, variable descriptions were chosen as a source for semantic annotation. To accomplish this, variable descriptions were decomposed into single words and short phrases, normalized, and then used as query terms to search controlled terminologies.

A total of 16,914 MCs were generated for 1514 variables used in this study (Table 5). As described above, stop words and phrases that contained at least 50% stop words were removed prior to executing the query. This step reduced the number of MCs by 3970. In addition, two Specialist Lexicons, LRAGR and LRNORM, were used for speech and tense conversion; consequently 868 MCs were converted to base forms.

Table 5.

Decomposing and normalizing results

	PGRN GROUP 1	PGRN GROUP 2	PGRN GROUP 3	PGRN GROUP 4	Total
Total number of MCs	7,389	3,827	1,857	54	13,127
Total number of MCs removed by stop words scanning	1,203	2,247	520	0	3,970
Total number of MCs converted by specialist lexicon	417	348	102	1	868

Open in a new tab

4.2. Semantic annotation

Annotated by controlled terminologies

MCs generated from above steps were annotated by controlled terminologies described in section 3.3. Invoking NCBO Bioportal RESTful API, annotation results were generated and rendered in the XML format shown in Table 6. All results were reviewed manually to determine the most matched appropriate terms.

Table 6.

Semantic annotation results

		PGRN GROUP 1	PGRN GROUP 2	PGRN GROUP 3	PGRN GROUP 4	Total
Total number of MCs		7,389	3,827	1,857	54	13,127
Total number of mappings from five terminologies		48,509	20,652	9,683	673	79,517
Total number of mappings	LOINC	12,308	4,852	2,409	158	19,727
	NCI Thesaurus	10,719	4,801	2,328	152	18,000
	NDF-RT	6,244	2,813	1,315	106	10,478
	RxNORM	6,758	2,838	1,083	105	10,784
	SNOMED-CT	12,480	5,348	2,728	152	20,708

Open in a new tab

Annotation review

Annotation results were reviewed by using the web application shown in Figure 2. For example, three MCs: “History”, “Hypercholesterolemia”, and “History of Hypercholesterolemia” were generated for a variable description, “History of Hypercholesterolemia”. Each of these MCs was used as a query term for searching the five aforementioned terminologies, results of which were reviewed by a curator. Term selection was based both on term definition as well as ST of the term. In this example, “Hypercholesterolemia” is a disease, so candidate terms that had a non-disease ST were excluded from consideration, and “personal medical history” with “Clinical Attribute” as ST was selected as the best term to represent the concept of “history”. Terminology preference for specific domains was also considered as a determine factor when a given concept had mappings to multiple terminologies. Specifically, SNOMED-CT was preferred for representing concepts related to disease, RxNorm was preferred for representing concepts related to medications, and LOINC was preferred for representing concepts related to laboratory tests. Finally, the example shown in Figure 2, “History of Hypercholesterolemia” was marked as a “complete mapping”, since the semantic meaning of the variable was completely captured using the selected terms.

Two observations became evident during the annotation step. First, variables in this study were, in general, highly pre-coordinated and therefore they required several concepts to capture their semantic meaning. For example, it is common to record a subject’s race in pharmacogenomics studies, since allele frequencies can vary widely among different racial groups. Furthermore, in family studies, it is common to record not only a primary subject’s race, but also a race of family members. The data dictionaries used for this study included several variables that captured the race of different individuals, each of which was semantically identical at both level of the variable description (“race category”) and its permissible values, e.g., “American Indian or Alaska Native”, “Asian”, etc.; see Table 1, but which differed from each other due to the term that represented relationship of individual in question to that of the primary subject. In these cases, it is preferable to use a generic variables to represent the primary concept, and its set of permissible values, or value domain, then add a qualifier to capture the distinguishing factor. While this may be difficult to achieve on a case report form or family history questionnaire, it is relevant to the data models that are used to represent the information.

Secondly, many variables were captured as derived values, e.g., the age of the subject at diagnosis, the age of the subject at hospitalization, etc, rather than as primary data, e.g., birth date, date of diagnosis, and date of hospitalization. While it is convenient to capture derived values that are relevant for a particular study, it is more difficult to utilize data set for secondary purposes. Capturing data as primary values simplifies data integration and reuse.

4.3 Categorization

Only variables with “complete mapping” label were moved into this categorization step. We used the selected annotation results with ST information and relied on human domain knowledge to categorize variables into categories. The categories are shown in Table 7. Note that the variables included in Table 7 were calculated based on the 797 “unique” variables only (see Table 4).

Table 7.

Categorization results and examples for 797 variables from four PGRN groups

Categories	# variables	Examples
Categories	# variables	variables	MC	Preferred Name	Concept Code	Terminologies	ST
Medication	170	Drug Strength	Drug	Substance	C459	NCI Thesaurus	Pharmacologic Substance
		Drug Strength	Strength	Pharmaceutical Strength	C53294	NCI Thesaurus	Qualitative Concept
		currently taking aspirin	aspirin	aspirin	1191	RxNorm	Organic Chemical
		currently taking aspirin	currently	Current	15240007	SNOMED CT	Temporal Concept
Disease Disorder	146	Lone atrial fibrillation	Lone atrial fibrillation	Lone atrial fibrillation	233910005	SNOMED CT	Disease or Syndrome
		History of Myocardial Infarction	Myocardial Infarction	Myocardial Infarction	22298006	SNOMED CT	Disease or Syndrome
		History of Myocardial Infarction	History	Personal Medical History	C18772	NCI Thesaurus	Clinical Attribute
Clinical Observation	71	Clinic diastolic blood pressure	Clinic	Clinic	C51282	NCI Thesaurus	Health Care Related Organization
Clinical Observation	71	Clinic diastolic blood pressure	Diastolic blood pressure	Diastolic blood pressure	271650006	SNOMED CT	Clinical Attribute
Laboratory	69	electrophysiology study	electrophysiology	electrophysiology	LP6252-3	LOINC	Laboratory Procedure
Smoking Status	65	What age quit smoking	Age	Age	LP28815-6	LOINC	Organism Attribute
			smoking	Tobacco Smoking	C17934	NCI Thesaurus	Individual Behavior
			stop	Stop	C65125	NCI Thesaurus	Activity
Demographics	62	Age	Age	Age	LP28815-6	LOINC	Organism Attribute
Demographics	62	Gender	Gender	Gender	LP61312-2	LOINC	Organism Attribute
Other categories	214	DNA Sample Number	DNA	DNA	LP32416-7	LOINC	Nucleic Acid, Nucleoside, or Nucleotide
			sample	Specimen	C19157	NCI Thesaurus	Physical Object
			Number	Number	C25337	NCI Thesaurus	Quantitative Concept

Open in a new tab

It is not surprising that pharmacogenomics data sets contain a relatively large number of variables that represent medications, diseases, clinical observations, laboratory values, and demographics. However, it should be noted that many laboratory-based variables, such as “gamma-glutamyl hydrolase activity in diagnostic bone marrows” and “R enantiomer of the primary metabolite Desmethyl Citalopram (ng/mL)”could not be fully annotated and therefore categorized since there was no suitable term, e.g., LOINC code to represent. This may be due to the fact that some laboratory tests that are used in pharmacogenomics studies are conducted in experimental, rather than clinical, labs. As pharmacogenomics data is integrated into clinical practice, it may be necessary to extend terminologies to represent new laboratory tests.

It was also striking that none of the pharmacogenomics data dictionaries used in this study contained variables that represented genomic data. Obviously, the research sites that provided the dictionaries generate and store genomic data. The absence of these elements in their data dictionaries may be a reflection of the relative immaturity of the application of pharmacogenomics data in a clinical setting and a tendency to consider the genomic data experimental. The lack of standards to represent pharmacogenomics data may also be a factor. Clearly, this is an area for future work.

4.4 Evaluation

Domain experts inside Mayo Clinic were invited to review our semantic annotation work, including the annotation selections and categorization outcomes. Based on their evaluation results, we performed two further evaluations to determine overall performance of our harmonization infrastructure. Valuable evaluations by PGRN sites have not been done, but will take place in the coming months.

Semantic annotation in this evaluation step, we considered annotation results only for the “unique” variables without duplicated and repeated ones. Table 8 shows that 93.6% PGRN variables in this study can be fully captured by the annotation results selected by curators. The number of complete annotations can increase by performing additional modifications for the variables with partial/no mapping.

Table 8.

Semantic annotation results

PGRN Groups	# variables	# “complete mapping”	# “partial mapping”	# “no mapping”
PGRN GROUP 1	317	295	11	11
PGRN GROUP 2	409	387	17	5
PGRN GROUP 3	107	97	4	6
PGRN GROUP 4	18	18	0	0
Total #	851	797 (93.6%)	32 (3.8%)	22 (2.6%)

Open in a new tab

Categorization with Semantic Types a total of 583 variables were grouped into six categories based on semantic types and domain knowledge. The matched results displayed as numbers along with percentages are shown in Table 9. From table 9, 509 variables (87.3%) have been successfully grouped into appropriate categories by ST, and 74 (12.7%) variables were not placed in any relevant categories by ST. Main reason of the 12.7% failure is a primary word missing in such variables, resulted in no corresponding ST assigned for these variables, such as “dose”, “Dosing frequency”, etc., which are missing “drug” as primary word. For such cases, we manually moved them into correct groups.

Table 9.

Categorization results with Semantic Types

	Demographics	Medication	Laboratory	Disease Disorder	Clinical Observation	Smoking Status
PGRN GROUP 1	9(69.2%)	85(100%)	31(72.1%)	28(87.5%)	47(83.9%)	5(83.3%)
PGRN GROUP 2	23(67.6%)	45(84.9%)	11(57.9%)	85(96.6%)	4(75%)	55(93.2%)
PGRN GROUP 3	6(54.5%)	24(100%)	4(80%)	22(100%)	8(100%)	0(100%)
PGRN GROUP 4	3(75%)	8(100%)	2(100%)	4(100%)	0(100%)	0(100%)
Total	41(66%)	162(95.3%)	48(69.6%)	139(95.2%)	59(83.1%)	60(92.3%)

Open in a new tab

5. Limitation and Future work

Variables from PGRN sites were not distinguished with value sets completely, that is to say, some variables were value sets. For example, we had “subject race” and “American Indian or Alaskan Native” as individual variables, and the second one should be the value set of the first one “subject race”. In this work, we did not differentiate these variables and process them separately, but in future work we will extract value set from the mixed data sets and combine permissible values provided by PGRN sites separately, and then standardize and load them into LexEVS [17] for future browsing and querying.

We aggregated and processed data from four PGRN groups, and generated six common categories in this work. However, the workflow reported in this paper will be used to handle datasets from more PGRN sites; and undoubtedly, more categories will be generalized on the basis of particular research focuses from these sites. Meanwhile, site-specific variables will be taken into account in future work.

Due to a huge portion of PGRN clinical data received currently, in this study, we were focusing on clinical data processes, which are relevant to laboratory test, medication, and disease. Meanwhile, we did collect some genomics sample data from particular PGRN groups, and we expect more genomics data descriptors will be able to be placed into our PGRN data repository in near future. Then we will collaborate with a joint Genomics Work Group, established by HL7 [18] and CDISC [19] to address problems associated with genomics data harmonization and generate PGRN specific genomics data standards.

To fill a gap between pharmacogenomics data standardization, linkage to Electronic Medical Record (EMR) and clinical research standards, further mappings with standardized clinical data models for each category will be taken into account. We propose to map PGRN variables from each category to Clinical Element Model [20], CDISC [19], Case Report Forms from caDSR [21], and PhenX [22]. This future work will not only make PGRN variables representable in a more standardized way, but also provide flexibility of bridging and expanding PGRN specific variables to the clinical data models.

6. Conclusion

Data and metadata standards help to mitigate problems that arise from semantic and syntactic differences between research groups. These differences are major barriers that hinder effective communication among scientists and that slow the pace of advancement and discovery. It is often difficult for those in rapidly advancing fields of study to converge on a set of standards before a significant volume of data is generated. This can result in the generation of large data sets that are difficult to interpret, merge together, and use in downstream analyses that were not part of the original study design. This work describes initial effort to harmonize data dictionaries from pharmacogenomics research sites. Our results demonstrate that there is a significant amount of variability in how data is represented among PGRN sites and that a larger standardization effort is needed.

Highlights.

We demonstrate harmonization/semantic annotation work for PGRN data dictionaries
These results are a critical first step toward data standardization for PGx Studies
Our approach avoids proliferation of proposed standards by controlled terminologies
Semi-automated system was developed to support the harmonization/annotation process

Acknowledgments

This work was supported by the NIH/NIGMS (U19 GM61388; the Pharmacogenomic Research Network).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Qian Zhu, Email: zhu.qian@mayo.edu, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.

Robert R. Freimuth, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.

Zonghui Lian, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.

Scott Bauer, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.

Jyotishman Pathak, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.

Cui Tao, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.

Matthew J. Durski, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA

Christopher G. Chute, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.

References

1.PGRN. [accessed by July.2012]; http://pgrn.org/display/pgrnwebsite/PGRN+Home.
2.Long RM, Berg JM. What to expect from the pharmacogenomics research network. Clin Pharmacol Ther. 2011;89:339–41. doi: 10.1038/clpt.2010.293. [DOI] [PubMed] [Google Scholar]
3.O’Donnell Peter H, Ratain Mark J. Germline pharmacogenomics in oncology: Decoding the patient for targeting therapy. Molecular oncology. 2012 doi: 10.1016/j.molonc.2012.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Ryan Abo, Scott Hebbring, Yuan Ji, Hongjie Zhu, Zhao-Bang Zeng, Anthony Batzler, Jenkins Gregory D, Joanna Biernacka, Karen Snyder, Maureen Drews, Oliver Fiehn, Brooke Fridley, Daniel Schaid, Naoyuki Kamatani, Yusuke Nakamura, Michiaki Kubo, Taisei Mushiroda, Rima Kaddurah-Daouk, Mrazek David A, Weinshilboum Richard M. Merging pharmacometabolomics with pharmacogenomics using ‘1000 Genomes’ single-nucleotide polymorphism imputation: selective serotonin reuptake inhibitor response pharmacogenomics. Pharmacogenetics and genomics. 2012 doi: 10.1097/FPC.0b013e32835001c9. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.The OMB defines race and ethnicity as two categories, and prefers that this data be collected separately. The OMB does allow a combined format, however, that combines the permissible values from each category. [accessed by July.2012]; http://www.census.gov/population/www/socdemo/race/Ombdir15.html.
6. [accessed by July.2012];Stop words. http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
7. [accessed by July.2012];Common english words. http://www.textfixer.com/resources/common-english-words.php.
8. [accessed by July.2012];UMLS Specialist Lexicon. http://www.nlm.nih.gov/pubs/factsheets/umlslex.html.
9. [accessed by July.2012];UMLS Semantic Types. http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html.
10. [accessed by July.2012];SNOMED CT. http://www.ihtsdo.org/snomed-ct/
11.Brown SH, Elkin PL, Rosenbloom ST, Husser C, Bauer BA, Lincoln MJ, et al. VA national drug file reference terminology: a cross-institutional content coverage Study. Medinfo. 2004:477–481. [PubMed] [Google Scholar]
12.Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. Journal of biomedical informatics. 2007;40(1):30–43. doi: 10.1016/j.jbi.2006.02.013. [DOI] [PubMed] [Google Scholar]
13.Liu SW, Ma R, Moore V, Ganesan S. Nelson, RxNorm: prescription for electronic drug information exchange. IT Professional. 2005;7 (5 ):17–23. [Google Scholar]
14.McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem. 2003;49:624–33. doi: 10.1373/49.4.624. [DOI] [PubMed] [Google Scholar]
15.Rubin DL, Moreira DA, Kanjamala PP, et al. BioPortal: A Web Portal to Biomedical Ontologies. Association for the Advancement of Artificial Intelligence; 2007. [Google Scholar]
16. [accessed by July.2012];NCBO REST API. www.bioontology.org/wiki/index.php/NCBO_REST_services.
17. [accessed by July.2012];LexEVS. https://wiki.nci.nih.gov/display/LexEVS/LexEVS.
18. [accessed by July.2012];HL7. www.hl7.org/
19. [accessed by July.2012];CDISC. www.cdisc.org.
20. [accessed by July.2012];Clinical Element Model. http://intermountainhealthcare.org/CEM/
21. [accessed by July.2012];caDSR CRF. https://formbuilder.nci.nih.gov/FormBuilder/
22.Stover PJ, Harlan WR, Hammond JA, et al. PhenX: a toolkit for interdisciplinary genetics research. Curr Opin Lipidol. 2010;21:136–140. doi: 10.1097/MOL.0b013e3283377395. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.PGRN. [accessed by July.2012]; http://pgrn.org/display/pgrnwebsite/PGRN+Home.

[R2] 2.Long RM, Berg JM. What to expect from the pharmacogenomics research network. Clin Pharmacol Ther. 2011;89:339–41. doi: 10.1038/clpt.2010.293. [DOI] [PubMed] [Google Scholar]

[R3] 3.O’Donnell Peter H, Ratain Mark J. Germline pharmacogenomics in oncology: Decoding the patient for targeting therapy. Molecular oncology. 2012 doi: 10.1016/j.molonc.2012.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Ryan Abo, Scott Hebbring, Yuan Ji, Hongjie Zhu, Zhao-Bang Zeng, Anthony Batzler, Jenkins Gregory D, Joanna Biernacka, Karen Snyder, Maureen Drews, Oliver Fiehn, Brooke Fridley, Daniel Schaid, Naoyuki Kamatani, Yusuke Nakamura, Michiaki Kubo, Taisei Mushiroda, Rima Kaddurah-Daouk, Mrazek David A, Weinshilboum Richard M. Merging pharmacometabolomics with pharmacogenomics using ‘1000 Genomes’ single-nucleotide polymorphism imputation: selective serotonin reuptake inhibitor response pharmacogenomics. Pharmacogenetics and genomics. 2012 doi: 10.1097/FPC.0b013e32835001c9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.The OMB defines race and ethnicity as two categories, and prefers that this data be collected separately. The OMB does allow a combined format, however, that combines the permissible values from each category. [accessed by July.2012]; http://www.census.gov/population/www/socdemo/race/Ombdir15.html.

[R6] 6. [accessed by July.2012];Stop words. http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/

[R7] 7. [accessed by July.2012];Common english words. http://www.textfixer.com/resources/common-english-words.php.

[R8] 8. [accessed by July.2012];UMLS Specialist Lexicon. http://www.nlm.nih.gov/pubs/factsheets/umlslex.html.

[R9] 9. [accessed by July.2012];UMLS Semantic Types. http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html.

[R10] 10. [accessed by July.2012];SNOMED CT. http://www.ihtsdo.org/snomed-ct/

[R11] 11.Brown SH, Elkin PL, Rosenbloom ST, Husser C, Bauer BA, Lincoln MJ, et al. VA national drug file reference terminology: a cross-institutional content coverage Study. Medinfo. 2004:477–481. [PubMed] [Google Scholar]

[R12] 12.Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. Journal of biomedical informatics. 2007;40(1):30–43. doi: 10.1016/j.jbi.2006.02.013. [DOI] [PubMed] [Google Scholar]

[R13] 13.Liu SW, Ma R, Moore V, Ganesan S. Nelson, RxNorm: prescription for electronic drug information exchange. IT Professional. 2005;7 (5 ):17–23. [Google Scholar]

[R14] 14.McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem. 2003;49:624–33. doi: 10.1373/49.4.624. [DOI] [PubMed] [Google Scholar]

[R15] 15.Rubin DL, Moreira DA, Kanjamala PP, et al. BioPortal: A Web Portal to Biomedical Ontologies. Association for the Advancement of Artificial Intelligence; 2007. [Google Scholar]

[R16] 16. [accessed by July.2012];NCBO REST API. www.bioontology.org/wiki/index.php/NCBO_REST_services.

[R17] 17. [accessed by July.2012];LexEVS. https://wiki.nci.nih.gov/display/LexEVS/LexEVS.

[R18] 18. [accessed by July.2012];HL7. www.hl7.org/

[R19] 19. [accessed by July.2012];CDISC. www.cdisc.org.

[R20] 20. [accessed by July.2012];Clinical Element Model. http://intermountainhealthcare.org/CEM/

[R21] 21. [accessed by July.2012];caDSR CRF. https://formbuilder.nci.nih.gov/FormBuilder/

[R22] 22.Stover PJ, Harlan WR, Hammond JA, et al. PhenX: a toolkit for interdisciplinary genetics research. Curr Opin Lipidol. 2010;21:136–140. doi: 10.1097/MOL.0b013e3283377395. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Harmonization and Semantic Annotation of Data Dictionaries from the Pharmacogenomics Research Network: a case study

Qian Zhu, PhD

Robert R Freimuth, PhD

Zonghui Lian

Scott Bauer

Jyotishman Pathak, PhD

Cui Tao, PhD

Matthew J Durski

Christopher G Chute, MD, DrPH

Abstract

1. Introduction

2. Motivation

Table 1.

3. Materials and Methods

Figure 1.

3.1. Data Pre-Processing

3.2 Decomposition and Normalization

Decomposition

Stop word removal

Normalization

Table 2.

3.3 Semantic annotation and categorization

Annotation with controlled terminologies

Annotation review

Figure 2.

Categorization

Figure 3.

Table 3.

4. Results

4.1. PGRN Data Dictionaries

Table 4.

4.2 Decomposing and Normalizing

Table 5.

4.2. Semantic annotation

Annotated by controlled terminologies

Table 6.

Annotation review

4.3 Categorization

Table 7.

4.4 Evaluation

Table 8.

Table 9.

5. Limitation and Future work

6. Conclusion

Highlights.

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases