Abstract
Sudden Unexpected Death in Epilepsy (SUDEP) is the leading mode of epilepsy-related death. The Center for SUDEP Research (CSR) is an NINDS-funded Center Without Wall’s initiative aimed at prospectively creating a comprehensive clinical research resource for SUDEP. This resource consists of a growing set of data and biological samples of a statistically significant cohort of patients at an elevated risk, best represented by the Epilepsy Monitoring Unit (EMU) patient population. The Informatics and Data Analytics Core (IDAC) of CSR has developed a state-of-the- art informatics infrastructure, to integrate patient data captured in multiple EMU’s at a greatly accelerated pace. Data quality assurance is a priority of IDAC. This paper reports our approach, Ontology-guided Data Curation for Multisite Clinical Research Data Integration (ODaCCI), to address the challenging task of centralized data curation while new data is continuously generated and integrated from distributed sites. ODaCCI leverages the Epilepsy and Seizure Ontology not only for upstream data capture, but also for supporting a range of quality assurance tasks such as data quality monitoring, data update, and data reports. Between October 2014 and February 2016, ODaCCI has integrated phenotypic and electroencephalogram signal data of 629 patients from 7 clinical sites, while supporting continuous and asynchronous data quality enhancement overtime.
1. Introduction
Data errors in research databases are prevalent. They include inaccurate interpretation of data in the initial documents and incorrect data entry into databases [1]. Data quality (DQ) in clinical research is a fundamental topic because it impacts the subsequent decisions and conclusions [2, 3]. Researchers have shown that electronic health records (EHR) often contain errors that may affect research results [4]. However, the process of data quality auditing and improvement is challenging because of multitudes of issues, such as the lack of provenance information, the lack of coordinated versioning and updates, and the time and effort needed.
Data quality assurance is a unique, acute challenge for prospective, multi-site studies such as the National Institute for Neurological Disorders and Stroke (NINDS) funded Center Without Wall’s initiative called the Center for SUDEP Research (CSR [5]). The CSR is a collaboration of 14 institutions, bringing together scientists and physicians to investigate and identify the molecular and structural brain abnormalities underlying SUDEP. The potential discoveries could then be utilized to identify features that could predict and identify patients at risk of SUDEP. Clinical phenotype, electroencephalogram (EEG), imaging and other data are prospectively collected from CSR patients in multiple participating Epilepsy Monitoring Units (EMUs). As the EMUs capturing multi-modal patient data, such data are de-identified and integrated into the CSR central data repository. Data correction and update at the source is less feasible than in the central data repository, sometimes because of the lack of resources, other times because of the lack of specific knowledge about a domain at an individual site. The Informatics and Data Analytics Core (IDAC) of CSR adopted a strategy to perform centralized data curation, by reviewing and cross-referencing linked physiological data such as video-EEGs to enhance data quality.
The paper introduces an ontology-guided approach to support the IDAC data curation strategy, called ODaCCI, by integrating curation and reporting interfaces into the CSR data integration pipeline. A collection of 95 common data elements were identified by CSR domain experts as the main curation targets. ODaCCI consists of an epilepsy domain ontology-guided, web-based, data curation interface to support data auditing, error correction, and data entry, with an interactive reporting interface for auditing the completeness of the types of data uploaded from multiple sites. A total of 629 patients’ data sets have been integrated from 7 clinical sites from October 2014 to February 2016. Among them, 393 are from University Hospitals of Cleveland (UH), 65 from New York University (NYU), 11 from
University of California, Los Angeles (UCLA), 54 from Northwestern University (NW), 44 from Thomas Jefferson University (TJU), 56 from University College London (UCL), and 6 from University of Iowa (UIowa). ODaCCI supports continuous and asynchronous quality improvement while data has been ingested and integrated from multiple clinical sites.
2. Background
2.1. Center for SUDEP Research (CSR)
Epilepsy is the most common serious neurological disorder, affecting 65 million people worldwide [6]. Sudden Unexpected Death in Epilepsy (SUDEP) is one of the leading modes of epilepsy-related death. More than 1 out of 1,000 epilepsy patients die from SUDEP each year [7, 8]. However, the mechanism causing SUDEP is not well understood [5]. CSR is an ongoing NINDS-funded Center Without Walls initiative for Collaborative Research in the Epilepsies [9], to accelerate the understanding of SUDEP by bringing together extensive expertise from 14 institutions across United States and Europe. Due to the low annual incidence of SUDEP (~1%), cross-institution data sharing are required to collect SUDEP/near-SUDEP data of sufficient statistical significance. This is the mission of CSR, with the goal of recruiting at least 2,500 cases from EMUs in the participating clinical sites to support prospective recruitment and identify possible risks for SUDEP.
2.2. CSR Data Integration Challenges
Major challenges to the CSR data integration include, but are not limit to, the following:
Data heterogeneity. Phenotypic data were captured in disparate formats (e.g., EHR, PDF documents) in different clinical sites. The integration of heterogeneous data is the most critical issue in a multisite research setting.
Data access restriction. This involves the protection of privacy for patients data. Researchers from one site should not have access to the protected health information (PHI) of the patients from other sites (e.g., patient name, data of birth) without additional IRB review and agreement.
Multimodal data linkage. Phenotypic data and electrophysiological signal data recorded in different devices need to be properly linked.
Data quality. Quality data is essential for use in clinical research [10]. However, in a multisite research effort, inconsistent codings are often seen across different clinical sites. It is unavoidable that unintended erroneous input may happen during the manual data entry process. Missing data is another major issue.
2.3. Related Efforts to CSR Data Integration
To minimize the potential format heterogeneity, a uniform electronic data capturing system, called OPIC (Ontology-driven Patient Information Capture system) [13], has been adapted and enhanced to prospectively capture patient phenotypic data for CSR. To avoid inconsistent coding, an enhanced version of OPIC has incorporated standardized coding for terms in the Epilepsy and Seizure Ontology (EpSO) [12]. In addition, automatic data validity checks (e.g., data must conform to an expected format) have been implemented to ensure better data quality at the data entry stage. The enhanced OPIC has been deployed as a Virtual Machine (VM) image in the clinical sites. These OPIC instances operate behind hospital firewalls.
To address the data access challenge, a central CSR data repository has been built to integrate and store de-identified patient data from multiple sites. A data de-identification module has been implemented in OPIC to remove PHI information before integrating into the central repository. To ensure patient-level record linkage without revealing PHI, a distributed study identifier generation algorithm has been implemented in OPIC using randomized n-gram hashing [11]. At the moment patient record is initially created, a unique study identifier is automatically generated to facilitate the data de-identification and linkage of multi-modal data.
2.4. CSR Data Quality Assurance
Although certain data quality assurance measures have been incorporated in the enhanced OPIC (e.g., data validity check, standardized coding), inconsistencies may still exist due to unintended manual input or misinterpretation of observations from electrophysiological signals. Review, auditing, and curation of the acquired data in the central repository by CSR domain experts are necessary for ensuring data quality. Inconsistencies found by observing and analyzing electrophysiological signal data will also be used to help curate phenotypic data. Moreover, since patient data is prospectively collected and continuously growing, the central data repository needs to incorporate both incoming data from each individual site as well as curated data by CSR domain experts. Therefore, there is a unique need of a streamlined data integration and curation workflow to facilitate the CSR data integration and quality assurance framework. This paper introduces ODaCCI to address this need.
3. Methods
3.1. Data Integration and Curation Workflow
Figure 1 shows the overall architecture of ODaCCI, for integrating and curating patient phenotypic data from an individual site to the CSR central repository. Initially, the patient phenotype data is entered into the OPIC system with identifiable information at each individual site. Such phenotypic data includes patient demographics, history, medications, diagnosis, epileptogenic zone, seizure semiology, and EEG findings. Then, automatic de-identification by OPIC and manual de-identification by study personnel are performed to produce de-identified data, which is transferred to the CSR central repository through SSH File Transfer Protocol (SFTP) on a weekly basis. The de-identified data for a single clinical site are then imported to a standalone MySQL database in the central repository, which is interfaced with the web-based data curation system for a domain expert to perform auditing and curation. The curated data is saved in a separate MySQL database that only contains curated data. The rationale for separating the curated data from the original integrated data is to accommodate both the continuously acquired data from the source clinical site and the existing curated data in the central CSR repository. Moreover, when integrating new data into the central de-identified database, the curated data in the separate database will replace certain original data that are not curated from the source.
Figure 1.

CSR phenotypic data integration and curation workflow.
3.2. ODaCCI: Ontology-guided Web-based Data Curation System
In this subsection, we focus on introducing each of the following components in ODaCCI:
Common data elements (CDEs), which are data elements that are common to all the individual clinical sites.
Ontology-based vocabulary, which serves as the main knowledge source for possible values of domain-specific data elements, and plays an important role in dynamic generation of web-based curation widget and MySQL statement to interacting with the backend database.
CDE to data source mappings, which link common data elements to actual tables and columns in the backend databases from different sources. Such mappings not only drive the rendering of the backend data in the web interface for curation, but also facilitate the translation of content captured in the web interface to MySQL statements to be executed in the backend database.
Dynamic generation of data curation widget, which provides a systematic way to generate web-based curation widgets for different types of data elements.
Dynamic generation of MySQL statements, which automatically translates the curation result obtained from the web-based widgets to backend MySQL statements.
Data auditing measures, which provide intuitive and interactive web-based reporting interfaces for domain experts to perform data auditing and curation.
3.2.1. Common Data Elements (CDEs)
A set of common data elements from all the data sources were selected by epilepsy domain experts for data integration. Selected epilepsy patient phenotypes include age, gender, epileptogenic zone, etiology, semiology, epileptiform discharge, drug, transection, body mass index, sleep position, smoking, and bleeding. Table 1 shows a partial list of common data elements organized in sections. For example, DEMOGRAPHIC section includes data elements Age and Gender; and CLASSIFICATION OF PAROXYSMAL EPISODES section consists of Epileptogenic zone, Etiology, Semiology, Nonepileptic semiology, and Lateralizing sign.
Table 1.
Sections and sample common data elements in each section. NA means not applicable.
| Section | Common Data Element | ||
|---|---|---|---|
| Name | Type | Properties | |
| DEMOGRAPHIC | Age | numerical | NA |
| Gender | categorical | NA | |
| CLASSIFICATION OF PAROXYSMAL EPISODES | Epileptogenic zone | ontological | modifier |
| Etiology | ontological | NA | |
| Semiology | ontological | modifier, only during admission | |
| Nonepileptic semiology | ontological | modifier | |
| Lateralizing sign | ontological | modifier, only during admission | |
| EVALUATION | Epileptiform discharge | ontological | modifier |
| Nonepileptiform abnormality | ontological | modifier, only during admission | |
| MRI/CT status | categorical | NA | |
| EEG type | categorical | NA | |
| PAST AND CURRENT MEDICATIONS | Drug | ontological | time taken |
| EPILEPSY SURGERY FORM | Corpus Callosotomy | boolean | NA |
| Transection | boolean | NA | |
| Deep brain stimulation | boolean | NA | |
| CHECKLIST | Body mass index (BMI) | numerical | NA |
| Sleep position | categorical | NA | |
| Cardiac disease | boolean | NA | |
| FOLLOW-UP FORM | Still having seizures | boolean | NA |
| Smoking | boolean | NA | |
| Drinking alcohol | boolean | NA | |
| SUDEP FORM | Frothing around mouth | boolean | NA |
| Fallen out of bed | boolean | NA | |
| Bleeding | boolean | NA | |
The value set (i.e., set of possible values or responses) of a common data element determines the type of the common data element. Types of common data elements include boolean, categorical, numerical, and ontological. A data element is called ontological if its value set originates from a vocabulary of ontological terms organized as a hierarchy. For both categorical and ontological data elements, their possible values are coded using integers to support effective data capture and retrieval. For instance,
boolean: Transection is boolean with value set {0,1} (0 means no and 1 means yes);
categorical: Gender is categorical with value set {1,2} (1 means male and 2 means female);
numerical: Body mass index is numerical;
ontological: Semiology is ontological with value set {1,2,…, 51} (1 means Aura, 2 mens Autonomic Seizure, and 51 means Hypnopompic Seizure). The hierarchical view of the value set can be found in Figure 2 (middle).
Figure 1.
Ontology-based vocabulary for Epileptogenic Zone, Semiology, and Drug.
A common data element may have properties such as modifier (see the last column in Table 1). For example, Semiology has two properties: modifier and only during admission. Semiology modifier may be Generalized, Bilateral asymmetric, or Left or Right Axial/Proximal/Distal/Head/Face/Arm/Hand/Leg/Foot. The value of only during admission may be Yes or No. These common data elements as well as their specifications form a common data dictionary managed in a Common Separated Values (CSV) file.
3.2.2. Ontology-based Vocabulary
We leverage the Epilepsy and Seizure Ontology (EpSO) [12] as the vocabulary to construct value sets for ontological data elements. For each ontological data element, its value set consists of all the direct and indirect subtypes of a class in the EpSO. For instance, in EpSO, Epileptogenic Zone (left in Figure 2) has seven subtypes: Anterior Head Regions, Generalized, Hemisphere, Multi Focal, Posterior Head Regions, Unknown, and Unlocalizable. And Hemisphere further has subtypes including Central, Cingulate, Frontal, and Occipital. For such ontological terms in EpSO, the data capturing system OPIC [13] has dedicated integer codes. Therefore, we use the same standardized codes for data curation in order to seamlessly integrate continuously collected data and curated data.
3.2.3. CDE to Data Source Mappings
Since the data capturing system OPIC has been distributively deployed and continuously operating at each clinical site, disparate sites may have different versions of OPIC, thus the data schemas in the de-identified data sources may not always be the same. Therefore, for each site, the common data elements in the common data dictionary are mapped to source data tables and columns if applicable. For example, for each data source containing patient phenotype Epileptogenic Zone, the common data element Epileptogenic Zone is mapped to the column “epileptogenic_zone id” in the table “classification_zones,” and its property Modifier is mapped to the column “modifier” in the same table. Such mapping information is maintained in CSV files for each data source.
3.2.4. Dynamic Generation of Data Curation Widget
To facilitate manual curation of patient phenotypic data by domain experts, we developed a web-based data curation interface driven by the common data elements and their mappings to data sources. When a user triggers an event to edit or curate a common data element, an interactive dialogue (see Figure 3) is dynamically generated based on the type and properties of the data element as follows.
If the type is boolean, then a dropdown option of Yes and No is provided for selection.
If the type is categorical, then a dropdown list containing the value set of the data element is provided.
If the type is numerical, then a text box is displayed for editing.
If the type is ontological, then a multi-level dropdown list is rendered according to the ontology-based vocabulary corresponding to the data element. For instance, clicking the “Add Semiology” button inside the interactive dialogue (arrow (a) in Figure 3) first triggers the display of all the direct subtypes of Semiology; clicking one the these subtypes “Aura” further renders its direct subtypes (arrow (b) in Figure 3); clicking the subtype Autonomic Aura further drills down to its direct subtypes (arrow (c) in Figure 3). In addition, the Modifier property of Semiology can be updated by clicking the icon after the its value (arrow (d) in Figure 3).
Figure 3.
Examples of dynamically generated interactive dialogues according to the types of common data elements.
Moreover, the default values for the data elements in the interactive dialogues are automatically populated from the data source. Therefore, the values that do not need curation are kept intact.
The web-based curation interface was implemented using an agile web development environment called Ruby on Rails (RoR) [18] with a MySQL backend database. The integrated and curated data sets are also stored in MySQL databases, which are separate from the main RoR application database. The specifications for the common data elements and ontology-based vocabularies in CSV files are imported into the application database to drive the web interface.
3.2.5. Dynamic Generation of MySQL Statement
The common data elements and their mappings to data sources not only drive the rendering of web-based curation interface, but also play an important role in the dynamic generation of MySQL statements for saving the interactive edits provided by domain experts. For each type of common data elements, a general template of MySQL statement is predefined and used for generating the actual MySQL statement for data curation. For example, the general template for an ontological common data element is predefined as:
INSERT INTO <mapping.table> (id, <mapping.column>, <property_1>, …, <property_n>) VALUES (<id_value>, <cde_value>, <property_1_value>, <property_n_value>);
where
<mapping.table>
and
<mapping.column>
represent the data source table and column to which the ocde is mapped, and
<property_i>
represents the i-th property of the ocde. All the variables in the angle brackets can be replaced by real values to generate the actual MySQL statement. For instance, in Figure 3,
the first record for the ontological common data element Semiology has the following values for the variables in the template:
<mapping.table>: seizure_type_semiologies
<id_value>: ‘TSXP606170783305’
<mapping.column>: semiology_id
<cde_value>: 40
<property_1>: modifier
<property_1_value>: ‘Left Arm’
<property_2>: only_during_admission
<property_2_value>: NULL
Replacing the variables in the template with real values results in the following MySQL statement:
INSERT INTO seizure_type_semiologies(id, semiology_id, modifier, only_during_admission) VALUES (‘TSXP606170783305’, 40, ‘Left Arm’, NULL);
3.2.6. Data Auditing Measures
Since CSR prospectively collects patient data from multiple sites, it is important to have data auditing measures to audit the quality of data being integrated. We adapt two commonly used data quality measures, completeness and consistency, to facilitate the CSR data integration and curation process (refer to [20] for a review of data quality measures). We audit the data completeness according to the types of data (phenotype and EEG signal) and common data elements, respectively. To perform auditing for types of data, the completeness is determined by whether both phenotypical and EEG signal data for a patient have been uploaded. For this, we develope an interactive reporting interface for CSR data manager to audit the uploading status of phenotypical and EEG signal data from individual clinical sites to the CSR central repository. To perform auditing for each common data element, the completeness is calculated or measured by the number of patient visits with actual data for the common data element divided by the total number of patient visits. Note that one patient may have multiple patient visits.
Moreover, we monitor coding consistencies for categorical and ontological common data elements, which is calculated by the number of valid values for the common data element divided by the total number of its actual values (i.e., the percentage of the valid coding values among all the actual values). For numeric common data elements, we use representation consistency, which is obtained by the number of numeric values divided by the total number of its actual values (i.e., the percentage of the valid numeric values among all the actual values).
4. Results
The CSR data integration pipeline and web-based data curation system have been implemented and deployed at https://medcis.case.edu/medcis. A collection of 95 common data elements were identified by CSR domain experts. A total of 629 patients have been recruited from seven clinical sites from October 2014 to February 2016. Among them, 393 are from UH, 65 from NYU, 11 from UCLA, 54 from NW, 44 from TJU, 56 from UCL, and 6 from UIowa. These numbers were generated in the reporting interface shown in Figure 5 (the numbers in the parentheses after the site names). The EEG signal data associated with these recruited patients exceeded 7TB. Since CSR is an ongoing effort, more patient data will be integrated and curated.
Figure 5.
A screenshot of the interactive reporting interface for auditing the type of data uploaded from all clinical sites.
4.1. Web-based Data Curation Interface
Figure 4 presents a screenshot of the web-based curation interface for the section CLASSIFICATION OF PAROXYSMAL EPISODES. Clicking the “Edit” button for a common data element such as Semiology will trigger an interactive dialogue for a curation widget (see the red arrow in Figure 4). Clicking the “Update” button when the curation is done will save the curation result to the backend database and reflect the updated result in the web interface. Such web-based data curation widgets not only provide domain experts or data curators an intuitive interface to review and correct data, but also ensure the quality of the corrected data by strictly conforming to the standardized codings used for data capturing in OPIC. The CSR domain experts have been actively using the data curation interface to correct unintended errors and inconsistencies found between patient phenotypic information and EEG signal data.
Figure 4.
A screenshot of the web-based curation interface for the section CLASSIFICATION OF PAROXYSMAL EPISODES.
4.2. Data Auditing Measures
Figure 5 shows the interactive reporting interface for auditing the completeness of the types of data uploaded from multiple sites. Domain experts can configure the report results by choosing sites of interest and the upload status of different types of data for each patient (EEG denotes the EEG signal data, PDF denotes the phenotypic data in PDF report). The returned results also include the number of visits for each patient (or Study ID). The customized reports can be exported to CSV files for the data manager in the CSR central repository to track down the types of missing data for each site and each patient.
Table 2 shows the data completeness of 10 common data elements by different sites and in total dated to February 2016. Age, Gender, and Drug achieved over 90% completeness. For Age, a total of 97.76% data completeness was obtained considering all sites. Individually, UH achieved 97.6% (487/499), NYU and UCLA got 100%, NW got 96.3%; but TJU, UCL, and UIowa only had 0% completeness for Age. The reason of 0% completeness for these three sites is that the OPIC instances deployed in these sites are not up-to-date, which is a common issue for multisite data integration or federated query (referred as data release cycle synchronicity in [19]). The most up-to-date percentages will be immediately available after the OPIC instances are updated.
Table 2.
Data completeness in terms of 10 common data elements for multiple sites up to February 2016.
| CDE | UH | NYU | UCLA | NW | TJU | UCL | UIowa | Total |
|---|---|---|---|---|---|---|---|---|
| Age | 97.6% | 100.0% | 100.0% | 96.3% | 0% | 0% | 0% | 97.76% |
| (487/499) | (67/67) | (6/6) | (52/54) | (0/0) | (0/0) | (0/0) | (612/626) | |
| Gender | 100.0% | 100.0% | 100.0% | 3.7% | 100.0% | 100.0% | 100.0% | 92.52% |
| (499/499) | (67/67) | (6/6) | (2/54) | (40/40) | (24/24) | (5/5) | (643/695) | |
| Drug | 93.39% | 100.0% | 100.0% | 81.48% | 100.0% | 50.0% | 100.0% | 92.09% |
| (466/499) | (67/67) | (6/6) | (44/54) | (40/40) | (12/24) | (5/5) | (640/695) | |
| Semiology | 78.76% | (100.0%) | 16.67% | 74.07% | 92.5% | 54.17% | 80.0% | 79.86% |
| (393/499) | (67/67) | (1/6) | (40/54) | (37/40) | (13/24) | (4/5) | (555/695) | |
| Etiology | 90.58% | 73.13% | 16.67% | 37.04% | 40.0% | 45.83% | 100.0% | 79.71% |
| (452/499) | (49/67) | (1/6) | (20/54) | (16/40) | (11/24) | (5/5) | (554/695) | |
| EEG Type | 90.38% | 16.42% | 100.0% | 88.89% | 12.5% | 12.5% | 80.0% | 75.97% |
| (451/499) | (11/67) | (6/6) | (48/54) | (5/40) | (3/24) | (4/5) | (528/695) | |
| Epileptogenic Zone | 73.95% | 88.06% | 16.67% | 51.85% | 47.5% | 50.0% | 60.0% | 70.65% |
| (369/499) | (59/67) | (1/6) | (28/54) | (19/40) | (12/24) | (3/5) | (491/695) | |
| MRI/CT status | 65.73% | 79.1% | 66.67% | 85.19% | 92.5% | 37.5% | 100.0% | 69.35% |
| (328/499) | (53/67) | (4/6) | (46/54) | (37/40) | (9/24) | (5/5) | (482/695) | |
| Ictal Seizure Type EEG | 62.73% | 85.07% | 100.0% | 64.81% | 0% | 0% | 0% | 65.65% |
| (313/499) | (57/67) | (6/6) | (35/54) | (0/0) | (0/0) | (0/0) | (411/626) | |
| Epileptiform Discharge | 57.31% | 77.61% | 16.67% | 57.41% | 67.5% | 8.33% | 80.0% | 57.99% |
| (286/499) | (52/67) | (1/6) | (31/54) | (27/40) | (2/24) | (4/5) | (403/695) |
For data consistency, all categorial and ontological data elements achieved 100% coding consistency, which demonstrates the effectiveness of using standardized codings for data capturing and integration. In addition, all numeric data elements achieved 100% representation consistency. Such data quality measures have been incorporated into the curation system in real time to facilitate data integration and curation.
5. Discussion
This paper presented ODaCCI, an ontology-guided data curation system for supporting multisite data integration in CSR. It is adaptable to other multisite clinical data integration and curation workflows because of its general design of defining common data elements and ontology-based vocabularies, and dynamic generation of web-based curation widget and dynamic interaction with backend databases.
For CSR, common data elements show different data completeness for disparate sites (Table 2). Since it is at the early stage of patient recruitment and data collection, values for certain data elements may still be missing from the source sites. Realtime data auditing from the CSR central curation system enables the center data curators to monitor the data entry process for each site, so that they can make data entry suggestions to each site for data elements with low completeness.
A limitation of ODaCCI is that it currently does not automatically handle the case that different sites may have data about the same patient. This is because OPIC instances are distributively deployed in multiple sites and cannot check potential duplicated patient information across different sites. Manual effort is needed to identify and resolve duplicated cases. Another limitation is that the web-based user interface has not been formally evaluated by end users. We plan to perform a usability evaluation of the web-based data curation interface and data auditing interface.
Related Work. Several related research efforts have been focused on multisite clinical research data integration and sharing. For instance, the Shared Health Research Information Network (SHRINE [21, 22]) was developed as a general clinical data integration system with federated query that could aggregate patient observations from multiple hospitals. It has been implemented for multisite studies of autism co-morbidity, colorectal cancer, diabetes, and others. SHRINE’s core ontologies include ICD-9-CM for diagnoses, RxNorm for medications, and LOINC for lab tests. The Human Studies Database Project (HSDB [23, 24]) has developed an informatics infrastructure for federated access and query of human studies databases. HSDB also developed the Ontology of Clinical Research (OCRe) to facilitate the data sharing. Similar to SHRINE and HSDB, ODaCCI uses a domain ontology to facilitate data integration.
There are also work on data quality assessment (DQA) of multisite clinical data. In [3], a “fit-for-use” conceptual model was proposed for DQA with a process model for conducting multisite DQA based on EHR data. In [25], DQA was performed based on EHR data from multiple student health centers in the College Health Surveillance Network. In [26], DQA was focused on Haiti’s national electronic medical record (EMR) systems, with an interactive “DQ dashboard” developed for quality improvement. Distinct from these studies [3, 21, 22, 23, 24, 25, 26], which focused on either multisite data integration or DQA, ODaCCI handles both multisite data integration and DQA. In addition, our web-based data curation system enables correction of data errors after data has been integrated concurrently.
6. Conclusion
This paper presented ODaCCI, an ontology-guided data curation system to address the data quality assurance needs for ongoing multisite clinical research data integration in the Center for SUDEP research. Phenotypic data and EEG signal data of 629 patients have been acquired and integrated from seven clinical sites between October 2014 and February 2016. Different types of data auditing measures have been incorporated in real time to help the CSR domain experts with their curation work. ODaCCI supports continuous and asynchronous data quality improvements while data has been ingested and integrated from multiple clinical sites.
Acknowledgement
This research was supported by the NINDS-funded Center Without Walls for Collaborative Research in the Epilepsies U01 awards (U01NS090408 and U01NS090405), as well as the University of Kentucky Center for Clinical and Translational Science (Clinical and Translational Science Award UL1TR000117).
References
- 1.Goldberg S, Niemierko A, Turchin A. Analysis of data errors in clinical research databases. In AMIA Annul Symposium Proceedings. 2008:242–6. [PMC free article] [PubMed] [Google Scholar]
- 2.Krishnankutty B, Bellary S, Kumar BN, Moodahadu LS. Data management in clinical research: an overview. Indian journal of pharmacology. 2012 Mar 1;44(2):168. doi: 10.4103/0253-7613.93842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kahn MG, Raebel MA, Glanz JM, Riedlinger K, Steiner JF. A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. Medical care. 2012 Jul.:50. doi: 10.1097/MLR.0b013e318257dd67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Dean BB, Lam J, Natoli JL, Butler Q, Aguilar D, Nordyke RJ. Review: Use of Electronic Medical Records for Health Outcomes Research A Literature Review. Medical Care Research and Review. 2009 Dec.66(6):611–38. doi: 10.1177/1077558709332440. [DOI] [PubMed] [Google Scholar]
- 5.Lhatoo S, Noebels J, Whittemore V. Sudden unexpected death in epilepsy: Identifying risk and preventing mortality. Epilepsia. 2015 Nov.56(11):1700–6. doi: 10.1111/epi.13134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. http://www.epilepsy.com/learn/epilepsy-101/what-epilepsy.
- 7. http://www.epilepsy.com/learn/impact/mortality/sudep.
- 8.Thurman DJ, Hesdorffer DC, French JA. Sudden unexpected death in epilepsy: assessing the public health burden. Epilepsia. 2014 Oct.55(10):1479–85. doi: 10.1111/epi.12666. [DOI] [PubMed] [Google Scholar]
- 9. http://sudepresearch.org.
- 10.Ancker JS, Shih S, Singh MP, Snyder A, Edwards A, Kaushal R. HITEC investigators. Root causes underlying challenges to secondary use of data. In AMIA Annual Symposium Proceedings. 2011:57–62. [PMC free article] [PubMed] [Google Scholar]
- 11.Zhang GQ, Tao S, Xing G, Mozes J, Zonjy B, Lhatoo SD, Cui L. NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research. JMIR medical informatics. 2015;3(4):e35. doi: 10.2196/medinform.4959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Sahoo SS, Lhatoo SD, Gupta DK, Cui L, Zhao M, Jayapandian C, Bozorgi A, Zhang GQ. Epilepsy and seizure ontology: towards an epilepsy informatics infrastructure for clinical research and patient care. Journal of the American Medical Informatics Association. 2014;21(1):82–9. doi: 10.1136/amiajnl-2013-001696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Sahoo SS, Zhao M, Luo L, Bozorgi A, Gupta D, Lhatoo SD, Zhang GQ. OPIC: ontology-driven patient information capturing system for epilepsy. In AMIA Annual Symposium Proceedings. 2012:799–808. [PMC free article] [PubMed] [Google Scholar]
- 14.Cui L, Bozorgi A, Lhatoo SD, Zhang GQ, Sahoo SS. EpiDEA: Extracting structured epilepsy and seizure information from patient discharge summaries for cohort identification. AMIA Annual Symp Proc. 2012:1191–1200. [PMC free article] [PubMed] [Google Scholar]
- 15.Cui L, Sahoo SS, Lhatoo SD, Garg G, Rai P, Bozorgi A, Zhang GQ. Complex epilepsy phenotype extraction from narrative clinical discharge summaries. Journal of Biomedical Informatics. 2014(51):272–279. doi: 10.1016/j.jbi.2014.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Mate S, Köpcke F, Toddenroth D, et al. Ontology-based data integration between clinical and research systems. PloS one. 2015;10(1):e0116656.. doi: 10.1371/journal.pone.0116656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zhang GQ, Cui L, Lhatoo S, Schuele S, Sahoo S. MEDCIS: Multi-Modality Epilepsy Data Capture and Integration System. AMIA Annual Symp Proc. 2014:1248–1257. [PMC free article] [PubMed] [Google Scholar]
- 18. http://rubyonrails.org/
- 19.Weber GM. Federated queries of clinical data repositories: Scaling to a national network. Journal of biomedical informatics. 2015;55:231–6. doi: 10.1016/j.jbi.2015.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Johnson SG, Speedie S, Simon G, Kumar V, Westra BL. A Data Quality Ontology for the Secondary Use of EHR Data. In AMIA Annual Symposium Proceedings. 2015:1937–1946. [PMC free article] [PubMed] [Google Scholar]
- 21.Weber GM, Murphy SN, McMurry AJ, MacFadden D, Nigrin DJ, Churchill S, Kohane IS. The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories. Journal of the American Medical Informatics Association. 2009 Oct.16(5):624–30. doi: 10.1197/jamia.M3191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.McMurry AJ, Murphy SN, MacFadden D, Weber G, Simons WW, Orechia J, Bickel J, Wattanasin N, Gilbert C, Trevvett P, Churchill S. SHRINE: enabling nationally scalable multi-site disease studies. PloS one. 2013 Mar.8(3):e55811. doi: 10.1371/journal.pone.0055811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Sim I, Carini S, Tu S, Wynden R, Pollock BH, Mollah SA, Gabriel D, Hagler HK, Scheuermann RH, Lehmann HP, Wittkowski KM. The human studies database project: federating human studies design data using the ontology of clinical research. AMIA Summits Transl Sci Proc. 2010 Mar.201(0):51–5. [PMC free article] [PubMed] [Google Scholar]
- 24.Sim I, Carini S, Tu SW, Detwiler LT, Brinkley J, Mollah SA, Burke K, Lehmann HP, Chakraborty S, Wittkowski KM, Pollock BH. Ontology-based federated data access to human studies information. AMIA Annual Symposium Proceedings. 2012:856–865. [PMC free article] [PubMed] [Google Scholar]
- 25.Nobles AL, Vilankar K, Wu H, Barnes LE. Evaluation of data quality of multisite electronic health record data for secondary analysis; InBig Data (Big Data), 2015 IEEE International Conference on; 2015. Oct. pp. 2612–2620. [Google Scholar]
- 26.Puttkammer N, Baseman JG, Devine EB, Valles JS, Hyppolite N, Garilus F, Honor JG, Matheson AI, Zeliadt S, Yuhas K, Sherr K. An assessment of data quality in a multi-site electronic medical record system in Haiti. International journal of medical informatics. 2016 Feb.86:104–16. doi: 10.1016/j.ijmedinf.2015.11.003. [DOI] [PubMed] [Google Scholar]




