Conversion of CPRD AURUM Data into the OMOP Common Data Model

Craig S Mayer

doi:10.1016/j.imu.2023.101407

. Author manuscript; available in PMC: 2024 Nov 10.

Published in final edited form as: Inform Med Unlocked. 2023 Nov 10;43:101407. doi: 10.1016/j.imu.2023.101407

Conversion of CPRD AURUM Data into the OMOP Common Data Model

Craig S Mayer ¹

PMCID: PMC10688258 NIHMSID: NIHMS1945870 PMID: 38046363

Abstract

Introduction:

Efforts to standardize clinical data using Common Data Models (CDMS) has grown in recent years. Use of CDMs allows for quicker understanding of data structure and reuse of existing tools. One CDM is the Observational Medical Outcomes Partnership (OMOP) CDM. Clinical Practice Research Datalink (CPRD) is a data collection program collecting general practitioner data in the UK.

Objective:

Our objective was to convert a static copy of CPRD AURUM data into the OMOP CDM and run existing tools on the converted data.

Methods:

Two methods were used to convert each CPRD file into the OMOP CDM. The first was direct mapping used when converting CPRD files that had comparable tables in the OMOP CDM. The original names were changed to the OMOP equivalent and source values converted to standardized OMOP concepts. CPRD files: Patient (to OMOP Person), Staff (to Provider), Drug Issue (to Drug Exposure) and Practice (to Care Site) were directly mapped. The second method was indirect where for the CPRD Observation file the domain of each data row was used to assign data to proper OMOP tables or columns done by converting all source values to standard concepts.

Results:

The OMOP CDM conversion populated 12 tables and 20,240,453,339 rows, with the largest table being the Measurement table (5,202,579,174 data row). Mapping source values to OMOP standard concepts, we found 60.2% (46,413 of 77,149) of source concepts were also standard concepts. The Drug Exposure table had the fewest source values already in the standard form as only 4.7% (1,433 of 30,194) of the source concepts were standard concepts. On a data retention level, only 2.00% of all data rows were excluded as they did not have a clear fit in the developed CDM and were not able to stand alone without additional information which was not present.

Conclusion:

CPRD AURUM was successfully converted into the OMOP CDM with minimal data loss. Existing OHDSI tools were used with the converted data to show efficacy of the converted data. The existence of a standardized version of CPRD AURUM data vastly increases its reusability in future research due to increased understanding and tools available.

Keywords: Data Science, Clinical Informatics, Real World Data, Common Data Model

Graphical Abstract

graphic file with name nihms-1945870-f0001.jpg

1. Introduction

The use of Common Data Models (CDMs) is an approach to harmonize and standardize collected clinical data and facilitate the ability to efficiently understand and analyze collected data[1,2]. The application of such methods enhances the reuse of preexisting tools[3–5]. Using a standardized framework also allows for the efficient and effective comparison of different datasets even if the datasets have multiple, differing terminologies or coding systems and originate from different contexts[6,7]. One such CDM is the Observational Medical Outcomes Partnership (OMOP) CDM developed and maintained by Observational Health Data Sciences and Informatics (OHDSI). Large scale datasets and studies, such as All of Us in the United States, are using the OMOP CDM to provide a standardized model in order to represent and share collected clinical data[8].There are also other efforts to convert previously collected data into the OMOP CDM such as a project converting data from an existing research program as well as another project converting collected claims data[9,10].

The Clinical Practice Research Datalink (CPRD) is a data collection initiative that collects de-identified patient level data from a network of UK based general practitioners (GPs). There are two types of CPRD data, Gold and AURUM[11–13]. The two datasets differ in that they contain data generated from different electronic health record (EHR) systems. Due to the software structure of the two EHR systems CPRD does not attempt to combine them. Previous work by Janssen converted the CPRD Gold data into the OMOP CDM[14,15].

Our project focuses on the conversion of CPRD AURUM into the OMOP CDM in order to conduct various analyses to characterize the converted data and assess data quality[16]. Additionally, the conversion into the OMOP CDM was done in order to standardize and harmonize the CPRD AURUM data for the comparison to disparate datasets converted to the OMOP CDM. This effort will lead to several clinical and metadata analyses across multiple datasets including our approved research study regarding COVID-19 analysis listed on the CPRD website[17].

2. Materials and Methods

2.1. CPRD AURUM

CPRD AURUM data captures demographic, diagnostic, drug, lab and referral information from select UK based GPs[11,13]. The data is regularly collected and updated with de-identified data made available to researchers upon acceptance of a study protocol. The extract contains several files including Observation, Drug Issue, Patient, Staff, Practice, Referral and Consultation. Due to the volume of data the extract is divided into 50 folders fully containing all information for approximately one million patients each. The data extraction includes coding dictionaries that contain definitions and mappings for medical codes, drug product codes, among others and in some cases provides the information in a previously existing terminology such as SNOMED CT[13]. Our conversion was conducted on a static full data extract with data ending in June 2021 but can be adapted to later extractions given conformity of the source material in the same format with identical column and file names. This extract was provided with the sole purpose of conducting this data conversion, with additional permissions required to conduct any follow-up research.

2.2. Overview of OMOP CDM

2.2.1. OMOP concepts

We converted the CPRD AURUM data into the OMOP CDM version 5.4. The first phase (Phase I as shown in Figure 1 below) of the conversion was to assign native terms to OMOP concepts. The OMOP CDM uses a controlled terminology repository to organize native terms into OMOP concepts[18]. This mapping consists of concepts that are assigned concept ids for preexisting vocabularies[19]. For example, for the condition ‘Lower respiratory tract infection (SNOMED CT code: 50417007) an OMOP concept id of 4175297 is assigned to it. This information is stored in the OHDSI vocabulary repository known as Athena[20]. Athena can be used to manually look up specific concepts or download the requisite tables storing the concepts and used terminologies. Each concept belongs to a single vocabulary (or terminology) with each vocabulary identified by a string identifier called the vocabulary id. For example, the vocabulary id of the SNOMED CT terminology is ‘SNOMED’. This information regarding these concepts is stored in the OMOP Concept table.

2.2.2. Concept mapping and the Concept Relationship table

OMOP acknowledges that there are various terminologies used in different contexts and therefore defines a terminology or terminologies for certain data domains (e.g., diagnostic history or medication history) to be implemented for the standardization of the used concepts[21]. Per OMOP model specifications, an OMOP concept can be either standard or non-standard. Standard concepts are the declared concepts that are to be used to represent unique clinical entities in the standardized OMOP clinical data tables[22]. The second conversion phase (Phase II as shown in Figure 1 below) was to convert non-standard concepts to OMOP standard concepts. The OMOP CDM allows for the mapping (standardization) of non-standard concepts to standard concepts by way of the Concept Relationship table and the relationship id of ‘Maps to’. The Concept Relationship table includes relationships between OMOP concepts including a non-standard to standard relationship. For example, for the drug Amoxicillin, the non-standard source concept id of 36122473 (DM+D code: 39732311000001100) maps to the standard concept id 19073183 (RxNorm: 308182).

2.2.3. OMOP Domain IDs

The third conversion phase (Phae III as shown in Figure 1) was the assigning of OMOP standard concepts to the appropriate OMOP table and column. Each OMOP concept is also assigned a domain id in the Concept table. The domain id states what domain (in some cases table or column) the concept belongs to, such as condition, measurement, etc.). Figure 1 depicts the conversion process and the three phases outlined above.

2.3. Direct table conversion and structure

For select CPRD AURUM files there were direct equivalent tables in the OMOP CDM version 5.4. For these tables we took the columns from the CPRD version of the table and directly converted the naming convention to the OMOP CDM equivalent to harmonize the data structure and fields to the OMOP CDM. This was true of four OMOP tables including the Person table (from Patient), Provider (from Staff), Care Site (from Practice) and Drug Exposure (from Drug Issue). Table 1 shows the column conversions for each of these tables.

Table 1.

Column conversion for direct table mapping.

CPRD Table	CPRD Column	OMOP Table	OMOP Column
Patient	patid	Person	person_id
	yob		year_of_birth
	mob		month_of_birth
	gender		gender_source_value
Staff	StaffID	Provider	Provider_id
Staff	Jobcatid	Provider	Specialty_source_value
Practice	pracid	Care Site	care_site_id
Practice	region	Care Site	location_id
Drug Issue	issueid	Drug Exposure	drug_exposure_id
	issuedate		drug_exposure_start_date
	prodcodeid		drug_source_value
	quantity		quantity
	quantunitid		dose_unit_source_value
	duration		days_supply

Open in a new tab

Note that for conciseness Table 1 excludes columns that appear in multiple tables listed. For example, person_id would be listed in both the Person and Drug Exposure tables with the same column conversion.

Some columns listed in Table 1 that denote a source value were mapped to the concept table to get the source concept id (Phase I outlined above) and then mapped to the standardized version to generate the standardized concept id (Phase II). For example, for the drug ‘Bendroflumethiazide 2.5 MG Oral Tablet’ the source value of 317919004 (DM+D code) maps to the source concept id of 21199966 which then maps to the standard concept id of 19073982.

Although these tables were a direct match, not all columns fit perfectly into each table. In the case of the CPRD Patient table the column pertaining to death date (cprd_ddate) belongs in the OMOP Death table rather than the Person table.

Also, in the case of the OMOP Drug Exposure table and the CPRD Drug Issue table, the naming of the tables were not an exact match. The OMOP Drug Exposure table pertains to a patient who was exposed to any drug at any point, whereas the CPRD Drug Issue table pertains to medical issues derived from the use of such drugs and this difference should be considered when understanding the data provided.

2.4. Manual mapping of gender

Since the gender source values did not originate from a formalized terminology and were custom coding in the source dataset, manual mapping was needed. In the case of gender source value, due to the small number of options (two distinct source values), the source values were manually mapped to the standardized concept id without finding a source concept id. The mapping was done by performing a manual lookup through Athena. For example, for gender source value 1 which equals ‘Male’, the value was directly mapped to the standardized concept of 8507.

2.5. Provider specialty mapping

Within the CPRD Observation table (discussed later) certain data rows included information regarding the specialty of the provider referenced in the data row. We used this information to obtain specialty source values that were in a controlled terminology. This was done as an alternative to the manual mapping of each specialty source value (here Jobcatid) listed in the CPRD Staff table, though manual review was done to ensure proper semantic equivalence. The specialty source value from the controlled terminology was then used to find the specialty source concept id and then concept id via the Concept and Concept Relationship table as described above.

2.6. Indirect table conversion and value mapping

The CPRD Observation file contains information regarding several different OMOP tables such as Measurement, Observation, etc. For each data row in the CPRD Observation file we converted the CPRD medcodeID to the standardized OMOP concept id (through the process shown in Figure 1). To do this we first joined the medcodeID to the provided CPRD medcode dictionary containing the map of the medcodeID to both SNOMED CT and Read codes. Since SNOMED CT is commonly the standard for OMOP we used the SNOMED CT code as the source value. We then joined the Concept table on the SNOMED CT code (source value) and concept code listed in the Concept table to get the equivalent OMOP concept id (source concept id) for the stated SNOMED CT value (Phase I as stated in the Overview of OMOP CDM section and depicted in Figure 1). Then using the Concept Relationship table and the ‘Maps to’ relationship id we mapped the concept id for the original SNOMED CT code to the standard version (concept id) (Phase II). Table 2 shows examples of this mapping.

Table 2.

Example source to standard concept mapping (Phase II).

Source Value	Description	Source Concept ID	Concept ID	Domain
21522001	Abdominal pain	200219	200219	Condition
54150009	Upper respiratory infection	4181583	4181583	Condition
329653008	ibuprofen 400 MG Oral Tablet	21293036	19019072	Drug
323416001	penicillin V potassium 250 MG Oral Tablet	21199988	19133873	Drug
322236009	acetaminophen 500 MG Oral Tablet	21311718	19020053	Drug
1022651000000100	Platelet count	37393863	37393863	Measurement
1022451000000103	Red blood cell count	37393849	37393849	Measurement
1022551000000104	Neutrophil count	37393856	37393856	Measurement
160573003	Alcohol intake	4052351	4052351	Observation
91930004	Allergy to egg protein	442116	4020878	Observation

Open in a new tab

To determine which OMOP table each data row in the CPRD Observation table would be assigned to, we joined the OMOP Concept table on the mapped standardized concept id. This gave us the domain id for the standardized concept allowing for the proper assignment of each data row in the CPRD Observation table to one of the six OMOP tables found in the CPRD Observation table (Phase III) as well as any additional OMOP columns specified such as data rows pertaining to the Type domain. The OMOP domains pertaining directly to OMOP tables included Measurement, Observation, Condition Occurrence, Procedure Occurrence, Device Exposure and Specimen. Also included in the CPRD Observation table were concepts with a domain of Drug, and thus would fit into the Drug Exposure table along with the previously mentioned direct conversion of the CPRD Drug Issue table.

We used the domain for the standardized concept to ensure the most accurate assignment and equivalence with other OMOP CDM mapped datasets, although it is possible that the standard concept may belong to a different domain than the source concept. For example, the SARS-CoV2 vaccination has a domain of Procedure for the source concept id (OMOP concept id: 3548104) while the standard concept has a domain of Drug (OMOP concept id: 724904). In this instance the data row would be assigned to the Drug Exposure table in line with the domain of the standardized concept.

Certain data rows had standard concepts from OMOP domains that did not match an OMOP table by name (such as type, metadata, etc.). For these we manually determined if and where it made sense to place the present values based on the stated domain. For the data rows with Type concepts, since these are not specific to a given OMOP table, we joined the information to data rows in the OMOP tables based on person id, provider id, and date. This gave the most accurate assessment of what the data rows stating the type were referring to. For other domains that are not able to stand alone (such as Unit and Route), we similarly joined this information to data rows in the appropriate table using the person id, provider id and date.

2.7. Populating the Visit Occurrence and Observation Period tables

The OMOP Visit Occurrence table is the most inclusive of all tables and indicates every visit for an individual. We populated the Visit Occurrence table by creating a visit for every unique date and provider combination associated with a data row from the CPRD Observation table for an individual. This will likely exaggerate the number of visits by separating potential multiple day visits and individual visits with multiple providers, however this issue would be mitigated by the nature of the data being general practice data rather than hospital or inpatient care where multiple day and multiple providers is more common. The Observation Period table was populated by assigning the observation period start date as the date of the first visit and the end date as the date of the last visit or the date of death.

2.8. Analysis of data volume

We calculated the volume of data in each OMOP table to better understand the type of data contained in the more than 20 billion data rows provided. This included quantifying the overall amount of data rows, the amount of distinct source values and standardized concepts in each OMOP table.

We also reviewed the extent of the concepts successfully converted into the OMOP CDM. This included calculating the amount and volume of original codes that were already in the standardized form, as well as the codes that did not have an OMOP concept id or did not map to any standardized value. A vast amount of unconverted data would lead to limitations in the successful use of OHDSI tools and comparison to other OMOP CDM databases along with the inaccurate capturing of the original data.

2.9. ETL process

R software was used to conduct the ETL of the CPRD AURUM data into the OMOP CDM. Several R packages were used to load the previously described CPRD AURUM data files and transform them into the OMOP CDM described above. The resulting conversion was loaded and stored in a SQLite database using the DBI R package. The R script used for the conversion can be found in our project repository[23].

2.10. Use of OMOP converted data

To test the success and utility of the OMOP conversion, we used previously created tools to extract useful information about the dataset. This included OMOP related tools previously created by our team with achilles2 and Scyros, as well as community based OHDSI tools such as Atlas cohort definitions (links to these tools can be found in the references)[24–26]. Achilles2 is a modified version of the OHDSI Achilles program that characterizes many feature of the database, with Scyros being an extension of Achilles and achilles2 to include new measures[24,25,27]. Results of achilles2 include the counts in the data volume section. Atlas is a program to develop OHDSI cohort definitions and queries[26]. We ran multiple Atlas cohorts as example usage and proof of utility of the converted data.

3. Results

3.1. Data volume

Based on the above conversion, Table 3 shows the amount of data rows in each of the included OMOP tables. In total the original CPRD data included 20,661,009,203 data rows. Of all data rows, 20,240, 453,339 (98.0%) were part of the tables listed in Table 3. 420,554,864 (2.0%) data rows were lost due to redundancy with already converted data, missing source concepts, a lack of a potential standardization mapping, or a lack of fit or utility in the OMOP CDM. On a concept level Table 3 also shows the amount of distinct standardized concept ids and source values.

Table 3.

Number of data rows, values and concepts in each OMOP table[16]

OMOP Table	Table Data Rows	Distinct Source Values	Distinct Standard Concepts	Domain
Measurement	5,202,579,174	6,229	6,141	Measurement
		117	110	Unit
		1	1	Measurement Value Operator
Drug Exposure	5,161,011,209	30,193	29,766	Drug
Drug Exposure	5,161,011,209	3	3	Route
Visit Occurrence	3,705,414,020	185	168	Visit
Observation	3,144,794,295	29,142	28,189	Observation
Condition Occurrence	1,923,463,583	42,377	40,488	Condition
Procedure Occurrence	1,048,783,300	10,766	10,258	Procedure
Person	49,102,289	2	2	Gender
Person	49,102,289	366	233	Race
Death	3,141,446	0	0
Provider	1,028,761	234	302	Specialty
Specimen	572,656	243	241	Specimen
		4,608	2,930	Specimen Anatomic Site
		3	3	Specimen Disease Status
Device Exposure	561,064	814	805	Device
Care Site	1,542	0	0

Open in a new tab

Since all data originated from the same source, an EHR system, there is only a single concept in the Observation Period table for the period type concept id that reflects EHR data (OMOP concept id: 32817).

While the Drug Exposure table includes data rows that originate from both the CPRD Observation and Drug Issue tables, the vast majority are from the Drug Issue table. Of the 5,161,011,209 data rows in the Drug Exposure table, only 251,625 (0.005%) originate from the CPRD Observation table.

Due to the nature of the tables and available data, there are no source values and standard concept ids in the converted Death or Care Site tables in our version of the OMOP conversion. In the overall OMOP CDM there would be columns such as cause concept id in the Death table; however, this column was unpopulated in our converted data as the CPRD AURUM data does not contain information about cause of death. In the case of Type concepts, which are present in many of the tables, we found 1,202 source values and 903 standard concepts overall. While the purpose of standardizing concepts is not to reduce the number of distinct concepts, we found that for each table the number of standardized concepts is less than the number of source values. This is because in many cases multiple source values convert to the same standardized concept. Overall, there are 115,890 distinct standard concepts and 119,766 source values.

3.2. Data element mapping

Some of the source concepts in each of the tables were already the standard version. In fact, due to the use of SNOMED CT (commonly the standard terminology in OMOP) the vast majority of source concepts were also the standard concept and thus did not require mapping. Table 4 shows the counts and percentages of such elements in each table. Overall, we found 60.2% (46,413 of 77,149) of the source concepts in CPRD were in the standard form.

Table 4.

Count of standard and non-standard source concepts.

Table	Total Source Concepts	Non-standard Source Concepts	Standard Source Concepts	Percentage of Standard Source Concepts
Condition Occurrence	42,381	2,397	39,984	94.30%
Device Exposure	814	13	801	98.40%
Drug Exposure	30,194	28,761	1,433	4.70%
Measurement	6,233	126	6,107	98.00%
Observation	29,142	1,176	27,966	96.00%
Procedure Occurrence	10,766	660	10,106	93.90%

Open in a new tab

The one OMOP table that is different is the Drug Exposure table where the majority, 95.3% (28,761 of 30,194) of the source concepts are not standard and required mapping to a standard concept. This is seen with 30,192 DM+D codes mapping to 18,431 (61.9% of standard drug concepts) SNOMED CT codes and 9,897 (33.2%) RxNorm or RxNorm extension codes.

3.3. Mapping domain differences

Not all standardized concepts had the same domain as the source concept. Overall, there were 32 standardized concepts that were from a different domain than the source concept. For example, the source concept of ‘SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) immunization course started’ (OMOP concept id: 3548106) is in the Observation domain, but maps to the standardized concept of ‘SARS-COV-2 (COVID-19) vaccine, UNSPECIFIED’ (OMOP concept id: 724904) which is part of the Drug domain. This is not very common as seen by the fact that this is only true for 32 distinct concepts.

3.4. Use of previously created OHDSI tools

Table 5 shows part of the results from our previously created Scyros program for the converted CPRD AURUM data. The full results can be found in our project repository with comparative results in the created Scyros repository[25]. On average each patient is in the dataset for 22 years with a median of 17 years (as shown in Table 5).

Table 5.

Results of Scyros on CPRD AURUM.

Metric	Value
Patient Count	49,102,289
Time Span (in days)	275,940
Median Enrollment Age (in years)	17
Mean Enrollment Age (in years)	20.40
Median Exit Age (in years)	36
Mean Exit Age (in years)	40.76
Median Time Span per Patient (in days)	6,284
Mean Time Span per Patient (in days)	8,040.85
Median Visit Count per Patient	30
Mean Visit Count per Patient	79.08

Open in a new tab

Using the OHDSI Atlas cohort definitions[28], in the CPRD AURUM data, we found 510,673 individuals deceased before the age of 65 (Atlas Cohort #1777338), 51,979 people with Exudative Age-Related Macular Degeneration (Cohort #1777098), and 9,439,694 individuals with no visits within five years of their initial visit (Cohort #1777387). The ability to run Atlas cohort definitions on the OMOP converted CPRD AURUM data was not limited to these three cohort definitions.

4. Discussion

4.1. Manual interpretation

While much of the conversion process could be automated using previously developed mappings present in the Concept Relationship table in the OMOP vocabulary layer[20], there were several steps where it was necessary for human interpretation and manual review. One such instance is the pulling of the death date column out of the CPRD Patient table and placing it in the OMOP Death table. Another example is the interpretation of the difference in the meaning of the CPRD Drug Issue table and the OMOP Drug Exposure table as the CPRD version pertains to an issue that arises from a drug and the OMOP table pertains to any drug exposure and the difference should be understood when analysis is conducted. Along those lines a manual mapping of the drug domain data rows in the CPRD Observation table was necessary to bind the rows to the OMOP Drug Exposure table without duplication. Manual review of disparate column names was also necessary to ensure proper alignment of columns. These manual steps taken can be easily understood and repeated with proper understanding of the OMOP CDM.

4.2. Use case and continuing work

For many researchers the conversion of data into a CDM is an initial step before performing an analysis[3]. This is the same for us. The conversion of the CPRD AURUM data was done as a preliminary step as part of multiple clinical use cases. One of these use cases is the CPRD approved study “Characterizing and analyzing COVID-19 diagnosed participants and their outcomes, such as mortality and disease progression based on contributing factors including comorbidities”. The study will analyze and characterize COVID-19 affected individuals in the CPRD AURUM dataset. This will be done by developing an analysis to be used on this OMOP converted data with results compared to other OMOP CDM datasets that have either already been converted or are in the process of being converted.

4.3. Dataset comparison

The use of the OMOP CDM allows for the efficient and effective comparison of different datasets even if the datasets have disparate terminologies and coding systems[7]. The ability to compare datasets allows for an understanding of differences based on context, type of data collected and geography. The assessment of such differences could help in determining which datasets are ideal for specific research questions and determine if there are potential causes of the differences between datasets. By converting the data to the OMOP CDM we can compare patient counts between CPRD AURUM data and data from the UK Biobank program[9]. This includes diagnoses where CPRD has diagnostic data in SNOMED CT, and UK Biobank, has diagnostic data in ICD-10 and ICD-9. For example, for Essential Hypertension (OMOP concept id: 320128) there are 3,644,542 patients in CPRD AURUM and 31,105 patients in UK Biobank. The same can be done for other disparate types of data including procedure (SNOMED CT for CPRD, OPCS for UK Biobank) and drug (DM+D for CPRD, Read for UK Biobank).

Similar comparison can be made using the results of the aforementioned tools. We previously ran the stated Scyros metrics on a set of datasets including clinical trials and EHR sources[29].

4.4. Excluded tables

Due to the nature of certain CPRD tables we decided to exclude them due to their lack of relevance and fit in the OMOP CDM. The two main excluded CPRD tables were the Referral and Consultation files. The data in these files were solely administrative and pertained to instances covered in the CPRD Observation table which was captured in our OMOP CDM conversion thus reducing any potential loss of data.

4.5. Limitations

With any data conversions there are many limitations[30]. The biggest one would be data loss. While the conversion mitigated any potential data loss, we acknowledge some data did not fit into the OMOP CDM, especially administrative related information. Our conversion is also limited in the mapping of individual source value to standard concepts as they might not be exact word for word matches, however this is mitigated by the fact that the source values are kept in the OMOP CDM and accessible if the source value is preferred.

5. Conclusions

The retrospective standardization and harmonization of existing datasets can vastly improve the reusability of the dataset. The CPRD AURUM data was able to be successfully converted into the OMOP CDM with minimal data loss and data quality concerns. In total 20,240,453,339 (98.0%) data rows were converted into the OMOP CDM and populated 12 OMOP tables. The converted data was able to be used in conjunction with preexisting OHDSI and OMOP CDM related tools and frameworks. The utility of the converted data and use of such tools allowed for the ability to characterize the CPRD AURUM data and compare to other datasets in the OMOP CDM. The existence of a standardized version of the CPRD AURUM data vastly increases its reusability in future research projects due to the increased understanding and tools available for data in the OMOP CDM. With the OMOP converted data we will be able to perform comparative multi dataset analyses, including on COVID-19 health outcomes across contexts and countries, via the use of equivalent analytic processes. Other similar type analyses are now feasible with a standardized version of the CPRD AURUM data, vastly decreasing the effort needed to semantically map CPRD AURUM data to disparate datasets.

Acknowledgement

This research was supported by the Lister hill National Center for Biomedical Communications of the National Library of Medicine (NLM), National Institutes of Health. We would like to thank Nick Williams and James Mork for providing comments on drafts of this manuscript.

Funding:

This research was funded by the National Library of Medicine (NLM), National Institutes of Health

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

1.Zoch M, Gierschner C, Peng Y, Gruhl M, Leutner LizA, Sedlmayr M, et al. Adaption of the OMOP CDM for Rare Diseases. In: Mantas J, Stoicu-Tivadar L, Chronaki C, Hasman A, Weber P, Gallos P, et al. , editors. Studies in Health Technology and Informatics [Internet]. IOS Press; 2021. [cited 2022 Dec 13]. Available from: https://ebooks.iospress.nl/doi/10.3233/SHTI210136 [DOI] [PubMed] [Google Scholar]
2.Ahmadi N, Zoch M, Kelbert P, Noll R, Schaaf J, Wolfien M, et al. Methods Used in the Development of Common Data Models for Health Data: Scoping Review. JMIR Med Inform. 2023. Aug 3;11:e45116. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Ahmadi N, Peng Y, Wolfien M, Zoch M, Sedlmayr M. OMOP CDM Can Facilitate Data-Driven Studies for Cancer Prediction: A Systematic Review. IJMS. 2022. Oct 5;23(19):11834. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Belenkaya R, Gurley MJ, Golozar A, Dymshyts D, Miller RT, Williams AE, et al. Extending the OMOP Common Data Model and Standardized Vocabularies to Support Observational Cancer Research. JCO Clinical Cancer Informatics. 2021. Dec;(5):12–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Dimitriadis VK, Gavriilidis GI, Natsiavas P. Pharmacovigilance and Clinical Environment: Utilizing OMOP-CDM and OHDSI Software Stack to Integrate EHR Data. In: Mantas J, Stoicu-Tivadar L, Chronaki C, Hasman A, Weber P, Gallos P, et al. , editors. Studies in Health Technology and Informatics [Internet]. IOS Press; 2021. [cited 2022 Dec 13]. Available from: https://ebooks.iospress.nl/doi/10.3233/SHTI210232 [DOI] [PubMed] [Google Scholar]
6.FitzHenry F, Resnic FS, Robbins SL, Denton J, Nookala L, Meeker D, et al. Creating a Common Data Model for Comparative Effectiveness with the Observational Medical Outcomes Partnership. Appl Clin Inform. 2015;06(03):536–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Xu Y, Zhou X, Suehs BT, Hartzema AG, Kahn MG, Moride Y, et al. A Comparative Assessment of Observational Medical Outcomes Partnership and Mini-Sentinel Common Data Models and Analytics: Implications for Active Drug Safety Surveillance. Drug Saf. 2015. Aug;38(8):749–65. [DOI] [PubMed] [Google Scholar]
8.Klann JG, Joss MAH, Embree K, Murphy SN. Data model harmonization for the All Of Us Research Program: Transforming i2b2 data into the OMOP common data model. Lovis C, editor. PLoS ONE. 2019. Feb 19;14(2):e0212463. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Papez V, Moinat M, Voss EA, Bazakou S, Van Winzum A, Peviani A, et al. Transforming and evaluating the UK Biobank to the OMOP Common Data Model for COVID-19 research and beyond. Journal of the American Medical Informatics Association. 2022. Oct 13;ocac203. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Williams N Building the observational medical outcomes partnership’s T-MSIS Analytic File common data model. Informatics in Medicine Unlocked. 2023;39:101259. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Wolf A, Dedman D, Campbell J, Booth H, Lunn D, Chapman J, et al. Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum. International Journal of Epidemiology. 2019. Dec 1;48(6):1740–1740g. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, et al. Data Resource Profile: Clinical Practice Research Datalink (CPRD). Int J Epidemiol. 2015. Jun;44(3):827–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Lee G CPRD Aurum Data Specification. :14. [Google Scholar]
14.CPRD GOLD - Janssen CDM Documentation [Internet]. [cited 2022 Dec 6]. Available from: https://ohdsi.github.io/ETL-LambdaBuilder/docs/CPRD
15.Matcho A, Ryan P, Fife D, Reich C. Fidelity Assessment of a Clinical Practice Research Datalink Conversion to the OMOP Common Data Model. Drug Saf. 2014. Nov;37(11):945–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.OMOP CDM v5.4 [Internet]. [cited 2023 Sep 15]. Available from: https://ohdsi.github.io/CommonDataModel/cdm54.html
17.Approved studies using CPRD Data | CPRD [Internet]. [cited 2023 Sep 15]. Available from: https://cprd.com/approved-studies-using-cprd-data
18.Home · OHDSI/Vocabulary-v5.0 Wiki [Internet]. [cited 2023 Feb 22]. Available from: https://github.com/OHDSI/Vocabulary-v5.0/wiki
19.index.knit [Internet]. [cited 2022 Dec 6]. Available from: https://ohdsi.github.io/CommonDataModel/
20.Athena [Internet]. [cited 2022 Jan 5]. Available from: https://athena.ohdsi.org/search-terms/start
21.OHDSI. OMOP CDM Specification [Internet]. [cited 2020 Dec 21]. Available from: https://ohdsi.github.io/CommonDataModel/cdm60.html#clinical_data_tables
22.General Structure, Download and Use · OHDSI/Vocabulary-v5.0 Wiki · GitHub [Internet]. [cited 2023 Nov 3]. Available from: https://github.com/OHDSI/Vocabulary-v5.0/wiki/General-Structure,-Download-and-Use
23.CRI/OHDSI/ETL/CPRD at master · lhncbc/CRI · GitHub [Internet]. [cited 2022 Dec 6]. Available from: https://github.com/lhncbc/CRI/tree/master/OHDSI/ETL/CPRD
24.CRI/achilles2.R at master · lhncbc/CRI · GitHub [Internet]. [cited 2022 Dec 6]. Available from: https://github.com/lhncbc/CRI/blob/master/AoU/achilles2/achilles2.R
25.CDE/scyros at master · lhncbc/CDE · GitHub [Internet]. [cited 2022 Dec 6]. Available from: https://github.com/lhncbc/CDE/tree/master/scyros
26.ATLAS: Home [Internet]. [cited 2021 Nov 5]. Available from: https://atlas-demo.ohdsi.org/#/home
27.OHDSI/Achilles: Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems (ACHILLES) - descriptive statistics about a OMOP CDM database [Internet]. [cited 2021 Nov 5]. Available from: https://github.com/OHDSI/Achilles
28.ATLAS: Cohort Definitions [Internet]. [cited 2023 Mar 16]. Available from: https://atlasdemo.ohdsi.org/#/cohortdefinitions
29.CDE/S2-comparison.csv at master · lhncbc/CDE · GitHub [Internet]. [cited 2022 Dec 14]. Available from: https://github.com/lhncbc/CDE/blob/master/scyros/S2-comparison.csv
30.Yoon D, Ahn EK, Park MY, Cho SY, Ryan P, Schuemie MJ, et al. Conversion and Data Quality Assessment of Electronic Health Record Data at a Korean Tertiary Teaching Hospital to a Common Data Model for Distributed Network Research. Healthc Inform Res. 2016;22(1):54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Zoch M, Gierschner C, Peng Y, Gruhl M, Leutner LizA, Sedlmayr M, et al. Adaption of the OMOP CDM for Rare Diseases. In: Mantas J, Stoicu-Tivadar L, Chronaki C, Hasman A, Weber P, Gallos P, et al. , editors. Studies in Health Technology and Informatics [Internet]. IOS Press; 2021. [cited 2022 Dec 13]. Available from: https://ebooks.iospress.nl/doi/10.3233/SHTI210136 [DOI] [PubMed] [Google Scholar]

[R2] 2.Ahmadi N, Zoch M, Kelbert P, Noll R, Schaaf J, Wolfien M, et al. Methods Used in the Development of Common Data Models for Health Data: Scoping Review. JMIR Med Inform. 2023. Aug 3;11:e45116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Ahmadi N, Peng Y, Wolfien M, Zoch M, Sedlmayr M. OMOP CDM Can Facilitate Data-Driven Studies for Cancer Prediction: A Systematic Review. IJMS. 2022. Oct 5;23(19):11834. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Belenkaya R, Gurley MJ, Golozar A, Dymshyts D, Miller RT, Williams AE, et al. Extending the OMOP Common Data Model and Standardized Vocabularies to Support Observational Cancer Research. JCO Clinical Cancer Informatics. 2021. Dec;(5):12–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Dimitriadis VK, Gavriilidis GI, Natsiavas P. Pharmacovigilance and Clinical Environment: Utilizing OMOP-CDM and OHDSI Software Stack to Integrate EHR Data. In: Mantas J, Stoicu-Tivadar L, Chronaki C, Hasman A, Weber P, Gallos P, et al. , editors. Studies in Health Technology and Informatics [Internet]. IOS Press; 2021. [cited 2022 Dec 13]. Available from: https://ebooks.iospress.nl/doi/10.3233/SHTI210232 [DOI] [PubMed] [Google Scholar]

[R6] 6.FitzHenry F, Resnic FS, Robbins SL, Denton J, Nookala L, Meeker D, et al. Creating a Common Data Model for Comparative Effectiveness with the Observational Medical Outcomes Partnership. Appl Clin Inform. 2015;06(03):536–47. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Xu Y, Zhou X, Suehs BT, Hartzema AG, Kahn MG, Moride Y, et al. A Comparative Assessment of Observational Medical Outcomes Partnership and Mini-Sentinel Common Data Models and Analytics: Implications for Active Drug Safety Surveillance. Drug Saf. 2015. Aug;38(8):749–65. [DOI] [PubMed] [Google Scholar]

[R8] 8.Klann JG, Joss MAH, Embree K, Murphy SN. Data model harmonization for the All Of Us Research Program: Transforming i2b2 data into the OMOP common data model. Lovis C, editor. PLoS ONE. 2019. Feb 19;14(2):e0212463. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Papez V, Moinat M, Voss EA, Bazakou S, Van Winzum A, Peviani A, et al. Transforming and evaluating the UK Biobank to the OMOP Common Data Model for COVID-19 research and beyond. Journal of the American Medical Informatics Association. 2022. Oct 13;ocac203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Williams N Building the observational medical outcomes partnership’s T-MSIS Analytic File common data model. Informatics in Medicine Unlocked. 2023;39:101259. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Wolf A, Dedman D, Campbell J, Booth H, Lunn D, Chapman J, et al. Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum. International Journal of Epidemiology. 2019. Dec 1;48(6):1740–1740g. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, et al. Data Resource Profile: Clinical Practice Research Datalink (CPRD). Int J Epidemiol. 2015. Jun;44(3):827–36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Lee G CPRD Aurum Data Specification. :14. [Google Scholar]

[R14] 14.CPRD GOLD - Janssen CDM Documentation [Internet]. [cited 2022 Dec 6]. Available from: https://ohdsi.github.io/ETL-LambdaBuilder/docs/CPRD

[R15] 15.Matcho A, Ryan P, Fife D, Reich C. Fidelity Assessment of a Clinical Practice Research Datalink Conversion to the OMOP Common Data Model. Drug Saf. 2014. Nov;37(11):945–59. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.OMOP CDM v5.4 [Internet]. [cited 2023 Sep 15]. Available from: https://ohdsi.github.io/CommonDataModel/cdm54.html

[R17] 17.Approved studies using CPRD Data | CPRD [Internet]. [cited 2023 Sep 15]. Available from: https://cprd.com/approved-studies-using-cprd-data

[R18] 18.Home · OHDSI/Vocabulary-v5.0 Wiki [Internet]. [cited 2023 Feb 22]. Available from: https://github.com/OHDSI/Vocabulary-v5.0/wiki

[R19] 19.index.knit [Internet]. [cited 2022 Dec 6]. Available from: https://ohdsi.github.io/CommonDataModel/

[R20] 20.Athena [Internet]. [cited 2022 Jan 5]. Available from: https://athena.ohdsi.org/search-terms/start

[R21] 21.OHDSI. OMOP CDM Specification [Internet]. [cited 2020 Dec 21]. Available from: https://ohdsi.github.io/CommonDataModel/cdm60.html#clinical_data_tables

[R22] 22.General Structure, Download and Use · OHDSI/Vocabulary-v5.0 Wiki · GitHub [Internet]. [cited 2023 Nov 3]. Available from: https://github.com/OHDSI/Vocabulary-v5.0/wiki/General-Structure,-Download-and-Use

[R23] 23.CRI/OHDSI/ETL/CPRD at master · lhncbc/CRI · GitHub [Internet]. [cited 2022 Dec 6]. Available from: https://github.com/lhncbc/CRI/tree/master/OHDSI/ETL/CPRD

[R24] 24.CRI/achilles2.R at master · lhncbc/CRI · GitHub [Internet]. [cited 2022 Dec 6]. Available from: https://github.com/lhncbc/CRI/blob/master/AoU/achilles2/achilles2.R

[R25] 25.CDE/scyros at master · lhncbc/CDE · GitHub [Internet]. [cited 2022 Dec 6]. Available from: https://github.com/lhncbc/CDE/tree/master/scyros

[R26] 26.ATLAS: Home [Internet]. [cited 2021 Nov 5]. Available from: https://atlas-demo.ohdsi.org/#/home

[R27] 27.OHDSI/Achilles: Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems (ACHILLES) - descriptive statistics about a OMOP CDM database [Internet]. [cited 2021 Nov 5]. Available from: https://github.com/OHDSI/Achilles

[R28] 28.ATLAS: Cohort Definitions [Internet]. [cited 2023 Mar 16]. Available from: https://atlasdemo.ohdsi.org/#/cohortdefinitions

[R29] 29.CDE/S2-comparison.csv at master · lhncbc/CDE · GitHub [Internet]. [cited 2022 Dec 14]. Available from: https://github.com/lhncbc/CDE/blob/master/scyros/S2-comparison.csv

[R30] 30.Yoon D, Ahn EK, Park MY, Cho SY, Ryan P, Schuemie MJ, et al. Conversion and Data Quality Assessment of Electronic Health Record Data at a Korean Tertiary Teaching Hospital to a Common Data Model for Distributed Network Research. Healthc Inform Res. 2016;22(1):54. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Conversion of CPRD AURUM Data into the OMOP Common Data Model

Craig S Mayer, MS

Abstract

Introduction:

Objective:

Methods:

Results:

Conclusion:

Graphical Abstract

1. Introduction

2. Materials and Methods

2.1. CPRD AURUM

2.2. Overview of OMOP CDM

2.2.1. OMOP concepts

Figure 1.

2.2.2. Concept mapping and the Concept Relationship table

2.2.3. OMOP Domain IDs

2.3. Direct table conversion and structure

Table 1.

2.4. Manual mapping of gender

2.5. Provider specialty mapping

2.6. Indirect table conversion and value mapping

Table 2.

2.7. Populating the Visit Occurrence and Observation Period tables

2.8. Analysis of data volume

2.9. ETL process

2.10. Use of OMOP converted data

3. Results

3.1. Data volume

Table 3.

3.2. Data element mapping

Table 4.

3.3. Mapping domain differences

3.4. Use of previously created OHDSI tools

Table 5.

4. Discussion

4.1. Manual interpretation

4.2. Use case and continuing work

4.3. Dataset comparison

4.4. Excluded tables

4.5. Limitations

5. Conclusions

Acknowledgement

Funding:

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases