Abstract
Acute lymphoblastic leukemia affects both children and adults. Rising costs of cancer care and patient burden contribute to the need to study factors influencing outcomes. This study explored the quality of datasets generated from a clinical research institution. The ‘fit-for-use’ of data prior to examining survival/complications was determined through a systematic approach guided by the Weiskopf et al. 3x3 Data Quality Assessment Framework. Constructs of completeness, correctness, and currency were explored for the data dimensions of patient, variables, and time. There were 11 types of data retrieved. Sufficient data points were found for patient and variable data in each dataset (≥70% of its cells filled with patient level data). Although there was concordance between variables, we found the distribution of lab values and death data to be incorrect. There were missing values for labs ordered and death dates. Our study showed that datasets retrieved can vary, even from the same institution.
Introduction
Acute lymphoblastic leukemia (ALL) is a blood malignancy noted for its quick progression and poor prognosis upon relapse1. Despite the improved therapeutic options for ALL in recent decades, only 20-40% of adults achieve long- term remission2,3. Treatment of ALL is expensive. On average, treatment costs for a newly diagnosed patient can be approximately $1 million (from diagnosis to relapse and from relapse until death)4. Due to the high cost of care, high resource utilization in health systems, and the burden on patients and their families, there is a need to better understand factors that may influence outcomes.
Most studies focused on adult ALL survival rely on clinical trial data of only a few hundred patients, or registry data from public health reporting mandates. Registries are often limited in offering robust clinical details of treatment and information on other health conditions. Real-world data—data collected about a person’s health status or the delivery of care, which may include data generated from the electronic health record (EHR), insurance claims, and the patient themselves—can be obtained from multiple sources beyond typical clinical research settings5. Utilizing EHR data has the potential to improve research generalizability because study subjects may be more representative of real-world populations compared to those who enroll in prospective clinical trials6.
Suitability of data for reuse is defined as “the fitness of a clinical dataset for the intended purpose, specifically the extent to which the dataset can meet research needs for observational studies.”7 Many studies reusing observational clinical data have identified issues with data quality8-10, because EHR data quality is considered highly variable10-13, and its “fitness for use” is dependent on the context of each research study14,15. Poor EHR data quality negatively affects the validity and reproducibility of research results16. Data correctness, for example, can range anywhere from 44-100%12. Completeness and sensitivity of different types of data can also vary depending on what research question is being pursued11-13. Hence, there is a need for systematic methods for EHR data quality assessment in the context of reuse7-9,14,16,17. Data quality assessment explores the dimensionality18 of data available, as well as the data requirements to perform an intended study.
Datasets populated from multiple data sites can be assessed by a two-stage approach9 where: (1) a within-site and across-site data quality assessment is established using consensus standards through comparisons of cross-tabulations and descriptive statistics; (2) an analysis on datasets is conducted, usually focused on the independent and dependent variables directly related to the research question. Typically, researchers will retrieve a dataset and concentrate efforts on the data cleaning19 necessary to test a hypothesis of interest within an already-assembled dataset (stage 2)9. Amongst the few data quality frameworks available10,17,20, there currently is no consensus when it comes to defining what healthcare data quality is or how the process is characterized21. A ‘fit-for-use’ perspective applied to data quality assessments means understanding whether the data is appropriate to be used for its intended purpose to either attain knowledge, make decisions, or plan for the future9,15. The Weiskopf et al. ‘3x3 Data Quality Assessment (DQA)’ Framework16 is one approach that offers a set of guidelines to systematically and transparently assess the quality of the data. However, few have attempted to fully operationalize the processes recommended in the DQA in the real-world setting16.
Being able to perform a survival analysis from high-quality, real-world data can pave way to a better understanding of ALL because much of the current research on survival is based on limited clinical trial and registry data. This paper explores stage 1 of the data quality assessment applied to the use case of examining the survival/complications of ALL patients comparing two clinical repositories from one academic medical center. The objective of this study is to operationalize the DQA guidelines by demonstrating the processes in a real-world dataset. This study aims to determine the quality of data generated from a clinical research institution, as part of a clinical data research network (CDRN).
Methods
The University of California (UC) Davis Comprehensive Cancer Center is a National Cancer Institute-designated comprehensive cancer center serving the California Central Valley and inland Northern California. This research study obtained approval from the Institutional Review Board at the University of California Davis (IRB# 228708).
We requested EHR data from adult patients (age ≥18 years) diagnosed with ALL from a 17-year date period (January 1, 2000 to December 31, 2017), identified using ALL diagnosis codes (ICD-9: 204, 204.0, 204.00, 204.01, 204.02, 204.2, 204.21, 204.22, 204.8, 204.80, 204.81, 204.82, 204.9, 204.91, 204.92, 200.1, 200.3, 200.4, 200.5, 200.6, 200.7, 200.8, 202, 202.0, 202.4, 202.7, 202.8, 202.9; ICD-10: C91, C91.0, C91.00, C91.01, C91.02, C91.3, C91.30, C91.31, C91.32, C91.4, C91.40, C91.41, C91.42, C91.5, C91.50, C91.51, C91.52, C91.6, C91.60, C91.61, C91.62, C91.Z, C91.Z0, C91.Z1, C91.Z2, C91.9, C91.90, C91.91, C91.92, C83.5, C83.50, C83.51, C83.52, C83.53, C83.54, C83.55, C83.56, C83.57, C83.58, C83.59).
Patient data was accessed from the health institution’s two local clinical data repositories: (1) a repository developed as a result of participation in the federated Patient-centered SCAlable National Network for Effectiveness Research (pSCANNER)22, and (2) an institutional clinical data warehouse. The data which populated these clinical data repositories came from the Clarity database (Epic EHR). Since this study had no intention to compare the overall level of superiority and advantages amongst the two databases, these clinical data repositories were referenced as either, ‘Data Repository 1 (DR1)’ or ‘Data Repository 2 (DR2)’, in this paper to mask the identities. Additionally, data sites within the clinical data research network (CDRN) were masked in this manuscript—aptly named, ‘Site 1’ through ‘Site 10’—to protect the identity of each site.
pSCANNER is a national, stakeholder-governed federated network that utilizes a distributed architecture that constitutes a highly diverse patient population with respect to insurance coverage, socioeconomic status, demographics, and health conditions—ideal for observational studies22. It utilizes a distributed, service-oriented architecture to allow access to clinical data in a standard form (i.e., Observational Medical Outcomes Partnership (OMOP)). The institutional clinical data warehouse, on the other hand, is a tethered meta-registry derived from the University of California Research eXchange (UC-ReX) using the Informatics for Integrating Biology and the Bedside (i2b2)23 data model. It is an EHR data warehouse that has more than five years of patient data.
The data request included any adult patient with a diagnosis of ALL. For each patient record, we requested the date of ALL diagnosis, date of ALL relapse, comorbidities at the time of ALL diagnosis, any previous cancer diagnosis, death date, treatments, discharge dispositions, and laboratory values associated with ALL. To determine whether the cohort discovery process returned enough of a sample for study feasibility, the number of patient records retrieved were compared with the results from a similar ALL study conducted by Master, et. al.24.
Data Analysis
This study piloted the feasibility of accomplishing the task of conducting a survival/complications analysis using real-world data, with the goal to identify areas for improvement to inform future research. Therefore, analysis was conducted on one site out of the ten sites within the CDRN. Version 1.0 of the Weiskopf et al. ‘3x3 Data Quality Assessment (DQA)’16 was utilized as a guideline to assess the quality of the data and determine fit-for-use.
1. Identifying scope
We answered 14 questions of the 3x3 DQA guideline. Level 1 (Questions 1-3) specified whether more than one patient, variable, or time point were involved in the study. Level 2 (Questions 4-11) focused on the variables of the study and the time points associated with the research. Level 3 (Questions 12-14) explored the data and quality needs of the data, such as, what time and order do the variables need to be in order to conduct research.
2. Operationalizing the three fundamental constructs of data quality across the three primary dimensions of data
Although the 3x3 DQA provided guidance on how to assess for completeness, correctness and currency for the patient, variable, and time data, the final operationalization of the assessment for ‘fit-for-purpose’ was left to the discretion of the informaticians.
Completeness of data gives insight to whether a truth about a patient exists14,16. The assessment of completeness means there is sufficient data for each patient, each variable, and each time. We assessed completeness by first determining an overall level of data missingness: we calculated the total number of possible data points by multiplying the variables, data characteristics, and patients. Then, we calculated the percentage of the actual number of data points available for every variable element. Since rates of completeness were subjective, contextual, and ungeneralizable from the literature or the DQA14, we defined a minimum percentage deemed sufficient for continued analysis to be 70%.
Correctness of data is when the value is true, whether there is agreement between the elements of data, and whether the value is plausible based on external knowledge16. We explored the construct of correctness for the data dimensions of patient, variable, and time by determining the plausibility of values. At the patient level, this meant determining if the data was considered within a plausible range for the patient population. We reviewed laboratory values of patients and determined if the lab values were acceptable based on a reliable external source. For example, we consulted with a hematologist to not only define plausible ranges, but ensure individual outlier lab values were plausible despite being out of range. Since we wanted to conduct a survival analysis, we compared the percentage of deaths found in the dataset to a national registry, such as Surveillance, Epidemilogy, and End Results (SEER).
Because we expected to retrieve different types of data (e.g., patient, labs, visits, treatment), it was important to see if there was agreement between the values as a proxy of data correctness. To determine concordance between variables, we identified variables where the value of one implies the value of another. For example, there should not be data present for patients after their death date because the patient is deceased. We determined concordance between variables by reviewing the dates of the retrieved files. We calculated the percentage of records that had data beyond the death date and set a limit of 99% correct in order to continue analysis, because the purpose of using our dataset was to conduct a survival analysis—more than 1% of its data beyond the death date may indicate incorrect death date values. Similar approaches were used for other variables such presence of a “platelet count” and “ALL diagnosis,” because treated ALL patients should have at least one lab value specific to their disease. We also calculated the average number of days there is data in the system (i.e., earliest ALL diagnosis date to most recent encounter date) to determine the percentage of patients with usable data. This helped us infer whether or not the person is considered an actual patient receiving regular care in the system.
Currency of data is when the data value is representative of the clinically relevant time16. We assessed if the variables were recorded in the desired order. We intended to observe the presence of comorbidities when conducting the survival analysis for this patient population. We defined a comorbidity as any Elixhauser25 categorized diagnosis that occurred on or before the earliest date of ALL diagnosis. To assess the currency of the data, we calculated the percentage of ‘no data prior to ALL diagnosis’, ‘presence of comorbidity’, and ‘absence of comorbidity.’ These values were compared with comorbidity incidence reported for adult ALL with another database26.
Results
There were 13,150 records identified in the preliminary search of patients in the CDRN from ten data sites. This cohort discovery returned enough of a sample to make this study feasible. This determination was done by comparing a recent, similar study on ALL24, where the dataset consisted of 17,504 total patients who either received chemotherapy or chemotherapy and stem cell transplantation. The adequate number of available records allowed for completeness of the data could be explored. We acquired a total of 11 file types for analysis from each of the datasites: de-identified list of patients with ALL, demographics, other cancer diagnoses, comorbidities, complications, death, labs, medications, relapse, treatment, and visit data.
Scope Determination of Data Quality Assessment
We defined the scope of our data quality assessment by answering the 14 questions in the DQA. To achieve the intended purpose of conducting a time-to-event analysis, we determined our study involved: (1) more than one patient, (2) more than one variable, and (3) more than one point in time for each patient. Our proposed survival analysis did not require the data to be recorded with a certain regularity over time because we presumed death happened only once. Based on our answers, the DQA recommended that we assess: (1) the completeness of patient, variable, and time data; (2) the correctness of patient, variable, and time data; (3) the currency of patient and variable data (Figure 1).
Figure 1.
Recommendations of what to assess for data quality based on Weiskopf et al. 3x3 DQA Framework.
* = Although the 3x3 DQA recommended the assessment of completeness for time data, correctness of time data, and currency of patient data, these items were not assessed because temporal data was not available for analysis.
We did not assess the currency of time data because the data we requested was not recorded with a set regularity. However, the DQA did recommend the assessments for completeness of time data, correctness of time data, and currency of patient data (Figure 1). We did not conduct these assessments due to the limitations of our dataset, as described below.
Assessing completeness of time data (Figure 1, Cell 3A) meant determining the sufficiency of data points at each measurement time (e.g., before and after an intervention). We could not evaluate this for our study. For example, labs ordered at the time of diagnosis could not be determined because we could not get a confirmed diagnosis date in the EHR. In this system, the first date of ALL diagnosis meant the first date the ALL diagnosis appeared in the facility’s EHR. The initial diagnosis from the clinician may or may not have been diagnosed at that particular institution.
Assessing correctness of time data (Figure 1, Cell 3B) meant determining whether the progression of data over time was plausible. For example, if height and weight were recorded for a patient, then one would assume that a growing person (up to age 21) would not decrease in height significantly. We did not explore this construct because the we did not request any particular sequential variables that were recorded with regularity over time (i.e., vitals, height, blood pressure, etc.).
Assessing the currency of patient data (Figure 1, Cell 1C) meant determining whether data were recorded during the timeframe of interest. The DQA suggested reviewing timestamp log information associated with each data point to ensure that the data were recorded within the desired range. We did not attempt this assessment because we did not request for timestamp log information for when data were recorded. Therefore, the construct of currency of patient data was not assessed.
Assessment for Completeness of Data
1A: Is there sufficient data for each patient?
We discovered that the term ‘patient records’ did not necessarily indicate individual patients. A higher number of patient records were reported in the cohort discovery process compared to actual unique patients found in the repositories due to multiple encounters recorded for each patient in the EHR. For example, this clinical site estimated attaining 500 records out of the 13,150 found in cohort discovery. Data retrieved from two repositories in one site using the same criteria resulted in slight differences in all eleven data types requested. DR1 (n=41 unique patients) produced data on three types—death, meds, and relapse—that were not present in DR2 (n=42 unique patients). To conduct a survival analysis, death data must be present. There was information on death for DR1, but no death recorded in DR2.
The two repositories yielded patient level data with an overall average of more than 70% of its cells filled across the 11 file types. However, DR1 produced 8.5 times more data than DR2 (DR1: 10,628,552 cells filled with data; DR2: 1,239,187 cells), with a higher overall mean ratio of cells filled with data (DR1: 78.4% of cells with data; DR2: 71.8%). Based on the volume of data produced and the availability of death data for survival analysis, DR1 was sufficient to continue with the quality assessment. Although DR1 had more data, the ‘complications’ file type of DR1 was only 68.2% complete compared to 98.2% in DR2. Because our time-to-event analysis requires a sufficient amount of ‘complications’ data (i.e. data relating to a diagnosis of an adverse effect from ALL treatment or care), it prompted further assessment of the values present at the variable level.
There were a total of 71 variables identified in the retrieved datasets. Results in the first part of our completeness analysis showed that despite DR1 possessing a sufficient amount of patient data cumulatively, two files types, ‘complications’ and ‘meds’, were below the established sufficiency minimum of 70%. Because one of intended purposes of retrieving this data was to conduct a time-to-complications analysis, we needed to assess the sufficiency at the sufficient variable level. Although the ‘complications’ and ‘meds’ files revealed missing data, it was important to identify which values among these variables were actually missing (Table 1). In Table 1, insight is given on how much data there is for each variable, as well as how many of the identified patients in each file type had data for that particular variable.
Table 1.
Presence of variable data in data repository 1 for ‘complications’ and ‘meds’ files
| % cells w/data present for this file type | DATA FILE TYPE | VARIABLE | % of records with this variable present | % of patients with this variable present |
| 68.2% | Complications | CONDITION_END_DATE | 2% | 19% |
| CONDITION_SOURCE_VALUE | 100% | 100% | ||
| CONDITION_START_DATE | 100% | 100% | ||
| PERSON_ID | 100% | 100% | ||
| VISIT_OCCURRENCE_ID | 39% | 100% | ||
| 66.5% | Meds | DAYS_SUPPLY | 0% | 0% |
| DRUG_CONCEPT_ID | 88% | 100% | ||
| DRUG_EXPOSURE_END_DATE | 3% | 69% | ||
| DRUG_EXPOSURE_ID | 100% | 100% | ||
| DRUG_EXPOSURE_START_DAT | 100% | 100% | ||
| DRUG_SOURCE_CONCEPT_ID | 88% | 100% | ||
| DRUG_SOURCE_VALUE | 100% | 100% | ||
| DRUG_TYPE_CONCEPT_ID | 100% | 100% | ||
| LOT_NUMBER | 0% | 0% | ||
| PERSON_ID | 100% | 100% | ||
| QUANTITY | 100% | 100% | ||
| REFILLS | 3% | 69% | ||
| ROUTE_CONCEPT_ID | 90% | 96% | ||
| ROUTE_SOURCE_VALUE | 99% | 100% | ||
| SIG | 3% | 69% | ||
| STOP_REASON | 24% | 62% | ||
| VISIT_OCCURRENCE_ID | 100% | 100% |
For the ‘complications’ file, we were interested in the variable data to characterize a unique person (i.e. ‘PERSON_ID’), the diagnoses of their complications (i.e., ‘CONDITION_SOURCE_VALUE’), and the start dates associated with each identified complication (i.e. ‘CONDITION_START_DATE’). In our findings, we saw that 26 out of the 41 patients had ‘complications’ data. In Table 1, we were able to discern what constitutes the ‘complications’ dataset: how much of the dataset had values for a particular variable and what percentage of the identified patients possessed data for that variable. Examining the data at the variable level showed that the missingness of values were attributed to limited amount of data present for the variables, ‘CONDITION_END_DATE’ and ‘VISIT_OCCURRENCE_ID’—two variables unnecessary for our particular study. The percentages of the key data elements available to conduct analysis were 100% (Table 1).
For the ‘meds’ file, we were interested in the variable data to characterize a unique person (i.e. ‘PERSON_ID’), the type of medication (i.e. ‘DRUG_TYPE_CONCEPT_ID’), and the start date of the medication (i.e. ‘DRUG_EXPOSURE_START_DATE’). We found that 100% of the records in the ‘meds’ file had a value for those variables, and 100% of the 26 patients with medication information had data for the variables we wanted. Our analysis showed that there was sufficient data for each relevant variable to continue assessing the quality of DR1.
Assessment for Correctness of Data
1B: Is the distribution of values across patients plausible?
The distribution of 6,727 patient lab tests was assessed for plausibility in DR1. Results showed that the majority of the lab tests (i.e. 18 out of 21 lab tests) possessed values that were out of range to varying degrees (Table 2). For example, chloride values deviated from the normal range 15% of the time, while platelet counts were out of range 63% of the time. An interdisciplinary team with qualifications in hematology/oncology, epidemiology, and informatics reviewed these lab test value ranges together. Clinical expertise and evaluation from the team’s physician, as a reliable external source, verified that although most of results were considered abnormal lab values, these values were plausible for someone diagnosed with this condition.
Table 2.
Review of plausibility of lab test data retrieved from data repository 1
| MEASUREMENT DESCRIPTION | # of records in file with any data | # of patients with any data in file | # of lab values in range | # of lab values out of range | % lab values out of range | Clinician Review for Plausibility |
| Creatine kinase [Enzymatic activity/volume] in Serum or Plasma Creatine kinase.MB [Mass/volume] in Serum or Plasma Troponin I.cardiac [Mass/volume] in Serum or Plasma |
3 1 3 |
3 1 3 |
3 1 3 |
0 0 0 |
0% 0% 0% |
# of records should match since these labs are ordered simultaneously |
| Hemoglobin A1c (Glycated) Cholesterol in LDL [Mass/volume] in Serum or Plasma by Direct assay |
1 4 |
1 3 |
0 3 |
1 1 |
100% 25% |
Considered plausible lab value despite being out of range |
| aPTT in Platelet poor plasma by Coagulation assay PT panel - Platelet poor plasma by Coagulation assay |
17 101 |
6 11 |
14 82 |
3 19 |
18% 19% |
# of records should match since these labs are ordered simultaneously |
| Leukocytes [#/volume] in Blood by Automated count Platelet count Potassium [Moles/volume] in Blood Sodium [Moles/volume] in Blood Urea nitrogen [Mass/volume] in Blood Calcium [Mass/volume] in Blood Carbon dioxide serum/plasma Chloride [Moles/volume] in Urine Creatinine [Mass/volume] in Blood Creatinine serum/plasma Erythrocytes [#/volume] in Blood by Automated count Glucose [Mass/volume] in Blood Hematocrit [Volume Fraction] of Blood by Automated count Hemoglobin |
503 506 513 509 509 509 509 509 509 3 503 509 503 503 |
18 18 18 18 18 18 18 18 18 1 18 18 18 18 |
160 188 421 406 242 327 285 432 389 0 36 150 47 38 |
343 318 92 103 267 182 224 77 120 3 467 359 456 465 |
68% 63% 18% 20% 52% 36% 44% 15% 24% 100% 93% 71% 91% 92% |
The number of patients with data shoud be higher because these labs are ordered for all patients with ALL (e.g., 41 patients in DR1 should have all these lab tests present if they have ALL) |
Despite a large number of records, for some lab tests ordered [e.g. leukocytes, 503 records, platelet count, 506 records, potassium, 513 records)], only 18 of 41 patients had lab data (Table 2). A clinical expert confirmed these labs were typically ordered for all patients with ALL. In this assessment, the expected number of patients with records should be 41 in DR1. It was not plausible that only 56% of the patients were missing lab values. In a contrary example, we would not expect ALL patients to have test such as labs HbA1c or cholesterol as these labs would be ordered for a patient who also has diabetes. Not all patients with ALL have diabetes. Another way we assessed plausibility for this table was to group certain tests together because they are usually ordered simultaneously. The expected numbers of patients with lab data should match. Table 2 showed creatine kinases and cardiac troponin were ordered together so the number of records and patients with this data should match, but they did not.
The number of deaths prompted the team to investigate how much data was actually retrieved. Despite the criteria in the data request including 17 years, DR1 produced data spanning 9 years (2009-2018) and DR2 produced data spanning 6 years (2012-2018). In assessing the plausibility of the population and death data, we found 6 recorded death dates among 41 patients in DR1. The mean follow-up time was 273.17 days among those deceased. The percentage of deaths in the dataset accounted for 15% of the cohort, compared to American Cancer Society’s estimation of annual deaths related to ALL of 2% (N=97,765)27. We found two issues related to the population and death data that affected the plausibility of the data elements. First, upon validating the number of patients in our dataset (N=41), we found it be substantially smaller than the physician’s own list of ALL patients at the same datasite (N=89). Secondly, the physician list confirmed more than 6 deaths in the last 2 years alone, which did not match our dataset’s finding of only 6 deaths total. Assessing the values associated with labs and death showed that the distribution of data across patients was not plausible.
2B: Is there concordance between variables?
Dates within the files were scrutinized to determine concordance between the variables in DR1. We determined the correctness of the death date by assessing whether the data were in agreement with the death date. Among the 11 data file types, we found one record (which accounted for 0.02% of the dataset) that had data on medications dated 2 days after the date of death. The research team reviewed this data and agreed as a group that the death date was most likely correct despite having medication data after the death date. The justification to keep this death date for analysis came from the possibility that the medications were ordered before the patient died. Another justification came from the review of the administrative billing (i.e., claims) data indicating that medication may have been dated 2 days post death as opposed to the medication being prescribed at that time. There could have been a disconnect where a family member or patient may have requested a refill, and the physician fills the refill request a few days later, but the patient is deceased. Despite concordance between variables, reviewing plausibility of the distribution of values across the patient data revealed that the dataset we had was not very correct.
Assessment for Currency of Data
2C: Were the variables recorded in the desired order?
Data were assessed to determine if the elements were recorded in the desired order. Data repository 1 produced ‘comorbidity’ data with 17 associated variables. We focused on the variables ‘PERSON_ID’, ‘CONDITION_CONCEPT_ID’, and ‘CONDITION_START_DATETIME’. Our definitions of what parameters fit the presence of comorbidity (Methods) gave us the following results: 88% of patients in DR1 had a comorbidity present compared to 50% in DR2. Approximately 7% of the patients in DR2 did not have a comorbidity diagnosis present DR1 had compared with 5% in DR1. The values were also very different among the data repositories. DR2 had more than four times the amount of missing comorbidity data. When compared to values with comorbidity incidence reported for adult ALL in an external study using another database26, comorbidity presence was at 21% and absence was at 79% (based on the Charlson Comorbidity Index rather than Elixhauser). We deemed the variables for analysis to be recorded in the desired order, but the inconsistency of the comorbidity data also pushed the team to question how much data was actually retrieved. DR1 produced 3 more years’ worth of data.
Discussion
Understanding the characteristics and usefulness of a derived dataset is important because erroneous data with accurate analysis will produce erroneous results. Our study shows that datasets retrieved can vary, even from the same institution. In this data quality assessment, the documentation of steps around retrieval, assessment, and profiling confirmed whether or not data was high quality.
Data profiling techniques were utilized to assess the quality of data generated from two clinical data repositories for determining survival and complications of patients with ALL at a single clinical datasite. Understanding the characteristics and usefulness of an EHR-derived dataset prior to conducting a survival analysis for adult ALL is important because erroneous data despite accurate analysis will produce erroneous results. Healthcare decisions are made every day based on the information available. Without understanding the quality of the data, we may be missing critical factors that may affect the trajectory of treatment and care.
Summary of Findings
The first data repository (DR1) had 41 patients (years 2009-2018), and the second data repository (DR2) had 42 patients (years 2012-2018). There were 11 types of data retrieved: unique patients diagnosed with ALL, other cancer diagnoses, treatment, medication, comorbidity, labs, visits, complications, demographic, relapse, and death data. We found sufficient data points for each patient and variable in each dataset (≥70% of its cells filled with patient level data). However, DR1 produced data on 3 types—death, medication, relapse—that were not present in DR2. DR1 produced 8.5 times more data points with a higher mean ratio of cells filled with data (DR1: 78.4%; DR2: 71.8%). Variables were recorded in the desired order for comorbidity data: DR1 showed 88% of patients with a comorbidity, while DR2 had 50%. Although there was concordance between variables, we found the distribution of lab values and death data to be incorrect. There were missing values for labs ordered and death dates.
Our study showed that data retrieved from the two data repositories were considered complete, current, but not correct. The absence of data may mean: (1) there is no data for that patient, or (2) there is data present for that patient, but it was not retrieved or captured as structured data. In a data quality assessment, an informatician may need to decide whether or not to keep certain data in or take it out for analysis. This step can change results, and without documentation, people cannot confirm whether the analysis is of high quality.
To conduct a survival analysis, we needed confirmation the date of when a person was diagnosed with ALL, death date, and last encounter date. In the EHR, it was difficult to confirm date of diagnosis since the earliest date of ALL diagnosis meant the first time the diagnosis appeared and may or may not mean the person was diagnosed prior to enrolling in this particular institution. It is important to note that the institutional EHR, and thus the secondary use of clinical data warehouses, may or may not reflect the patient’s entire medical record, as patients enter and leave the clinical institution for different reasons. These reasons may include changes in health plan, being referred by their community oncologist for super-specialty care, or requiring emergency care despite receiving medical care elsewhere, to name a few. Clinical data fragmentation is a reality of real-world data28. For our datasets, the California Department of Public Health has recently began providing a monthly non-comprehensive death file to healthcare organizations for the purpose of adjudicating EHR record vital status, but this was not yet incorporated into our EHR at the time of our data request. Some of these missing values were attributed to the fact that a number of lab results and death information were in unstructured format in the clinical notes. We did not specifically ask for vital status in our data request, but we instead used “most recent” date as the date of last encounter. This study depended upon availability of data from the clinical data repositories of sufficient quality and completeness to conduct the proposed analyses. Although not fully assured of data quality until obtaining it, we had some measure of confidence because pSCANNER had previously collected OMOP-standard data to deliver results for queries from the Patient Centered Outcomes Research Institute as part of the national PCORnet project for the past three years. The UC-wide data warehouse also standardizes on the OMOP data model.
A significant finding for this study is being able to understand why there is a discrepancy with the data outputs from the two repositories of the same clinical site. There are different underlying relational data models despite presumably containing the same data. There may be systematic process issues happening that may serve as learning opportunities. The data was retrieved from a site’s two local data repositories, one of which was retired April 2019. The query logic was not developed by the same person who developed the data request. To retrieve data from two data repositories, the analyst converted it from TSQL to Oracle, and populated all metadata tables as filters. The data for validating and comparing our patient numbers was done through another database which generates the patient list through the physician’s panel by Clarity (EHR). Multiple sources of data can give different results depending on how, what, when and where you retrieve it. A clinical repository is at least one-step away from the actual real-world-data that was originally stored at the point of care or as part of a healthcare transaction. What happens with the data in subsequent exporting from the transactional database to a reporting database (e.g., Clarity), and beyond is often a source of introduced errors of omission and semantic differences from mappings of EHR data to standardized codes. The lack of data from 2009-2014 in DR1 and DR2 showed that much of the data could not be analyzed due to missingness.
Strengths and Limitations
Although data quality analysis is not new, this study is the first to systematically assess and report the quality of the adult ALL patient data retrieved from two repositories within one clinical data research network prior to conducting a time-to-event analysis. The authors have operationalized the Weiskopf et al. 3x3 Data Quality Assessment and documented the process to improve future work in bringing more clarity on the execution with this use case. In the field of informatics, it takes a team to work with data. Collaboration is important especially when it comes to understanding information. Each discipline brings in a viewpoint that may help you see data differently. Our team consisted of a graduate student, data scientist, epidemiologist, hematologist, and computer scientist to help look at the data and give input on how to view data from each of their lens.
There were some limitations to our study. First, the small sample size to conduct our analysis made the work less generalizable. This was a pilot study to operationalize data quality guidelines. Although the input query was the same, the dataset retrieved from the datasite varied which made comparing and validating the data challenging. We also did not look at temporality of the data, an important factor in clinical informatics. We plan to investigate this in a future study. An interesting future exploration can also consider why datasets using database standards meant to harmonize medical data would produce varying results. As this was the first attempt to retrieve high volume, high variety, and high velocity cancer data across multiple departments within a clinical site, the researchers learned a lot about developing a data request, creating a query, and investigating the barriers to high quality clinical data for analysis. Future work in this area would be to refine the query and explore fitness of the data with more data repositories.
Conclusion
The validation of data repositories does not always happen when they are automatically created from the EHR through a process of exporting and mapping that may introduce problems/errors. It is critical to propose a standardized set of query pseudo-logic to test specific characteristics of a clinical data warehouse, showing an advantage of using a standardized widely used common data model. This study explored the quality of data generated from two clinical data repositories prior to determining survival and complications of patients with cancer. Conducting a data quality analysis as a way to understand the usefulness of EHR data prior to conducting a survival analysis for adult ALL is important. It shows that we should review how variability in data can affect results when utilizing EHR as a secondary dataset for research. The completeness, correctness, and timeliness of the data cannot be assumed. Before using a dataset to examine survival or complications, researchers should first comprehensively assess it with a ‘fit-for-purpose’ perspective.
Acknowledgements
This work was funded by PCORI Contract CDRN-1306-04819. For championing our study, we thank the pSCANNER PIs: Drs. Lucila Ohno-Machado, Kai Zheng, Lisa Schilling, Matthew Matheny, Spencer Soohoo, Mary Whooley, Michael Aratow, Daniella Meeker, Arash Naeim, and Douglas Bell. We thank the pSCANNER project managers for logistical problem-solving: Kate Marie, Calvin Chang, Dr. Michele Day, Robynn Zenker, Jessica Bondy, Freneka Minter, Brian Tep, Nirupama Krishnamurthi, David Anderson, Tara Knight, Antonia Petruse, and Marianne Zachariah. We thank Paulina Paul for query script creation; Zhen Wu for local data validation; Kai Post for query distribution; Steven Zeck for sFTP coordination; Chris Lambertus and Dr. Nicholas Anderson for UCD’s analysis space; Liyang Zhong, Qiaohong Hu, Ray Pablo, and Nelson Lee for usable dataset delivery; Thomas Salazar for IRB assistance. We thank Drs. Sheryl L. Catz and Elizabeth I. Rice for thoughtful feedback on drafts of the manuscript.
Figures & Table
References
- 1.Oriol A, Vives S, Hernandez-Rivas JM, et al. Outcome after relapse of acute lymphoblastic leukemia in adult patients included in four consecutive risk-adapted trials by the PETHEMA Study Group. Haematologica. 2010;95:589–96. doi: 10.3324/haematol.2009.014274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Seiter K. Acute Lymphoblastic Leukemia (ALL) Medscape Reference. 2018.
- 3.Jabbour E, O’Brien S, Konopleva M, Kantarjian H. New insights into the pathophysiology and therapy of adult acute lymphoblastic leukemia. Cancer. 2015;121:2517–28. doi: 10.1002/cncr.29383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Princic N, Song X, Lin V, Cong Z. Healthcare Costs Associated with Adult Philadelphia Chromosome-Negative (Ph-) Acute Lymphoblastic Leukemia (ALL) in the US. Blood. 2016;128:5940. [Google Scholar]
- 5.Daniel G, Silcox C, Bryan J, McClellan M, Romine M, Frank K. Characterizing RWD quality and relevancy for regulatory purposes. 2018.
- 6.Rothwell PM. External validity of randomised controlled trials:“to whom do the results of this trial apply?”. The Lancet. 2005;365:82–93. doi: 10.1016/S0140-6736(04)17670-8. [DOI] [PubMed] [Google Scholar]
- 7.Shang N, Weng C, Hripcsak G. A conceptual framework for evaluating data suitability for observational studies. J Am Med Inform Assoc. 2017. [DOI] [PMC free article] [PubMed]
- 8.Wang RY, Strong DM. Beyond accuracy: What data quality means to data consumers. Journal of management information systems. 1996;12:5–33. [Google Scholar]
- 9.Kahn MG, Raebel MA, Glanz JM, Riedlinger K, Steiner JF. A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. Med Care. 2012;(50 Suppl):S21–9. doi: 10.1097/MLR.0b013e318257dd67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. Journal of the American Medical Informatics Association. 2013;20:144–51. doi: 10.1136/amiajnl-2011-000681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Chan KS, Fowles JB, Weiner JP. Review: electronic health records and the reliability and validity of quality measures: a review of the literature. Med Care Res Rev. 2010;67:503–27. doi: 10.1177/1077558709359007. [DOI] [PubMed] [Google Scholar]
- 12.Hogan WR, Wagner MM. Accuracy of data in computer-based patient records. J Am Med Inform Assoc. 1997;4:342–55. doi: 10.1136/jamia.1997.0040342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Thiru K, Hassey A, Sullivan F. Systematic review of scope and quality of electronic patient record data in primary care. BMJ. 2003;326:1070. doi: 10.1136/bmj.326.7398.1070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform. 2013;46:830–6. doi: 10.1016/j.jbi.2013.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Juran J, Godfrey AB. Quality handbook. Republished McGraw-Hill. 1999:173–8. [Google Scholar]
- 16.Weiskopf NG, Bakken S, Hripcsak G, Weng C. A data quality assessment guideline for electronic health record data reuse. eGEMs. 2017:5. doi: 10.5334/egems.218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kahn MG, Callahan TJ, Barnard J, et al. A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS (Wash DC) 2016;4:1244. doi: 10.13063/2327-9214.1244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wand Y, Wang RY. Anchoring data quality dimensions in ontological foundations. Communications of the ACM. 1996;39:86–95. [Google Scholar]
- 19.Wu S. A review on coarse warranty data and analysis. Reliability Engineering & System Safety. 2013;114:1–11. [Google Scholar]
- 20.Lee K, Weiskopf N, Pathak J. A Framework for Data Quality Assessment in Clinical Research Datasets. AMIA Annu Symp Proc. 2017;2017:1080–9. [PMC free article] [PubMed] [Google Scholar]
- 21.Johnson SG, Speedie S, Simon G, Kumar V, Westra BL. Application of An Ontology for Characterizing Data Quality For a Secondary Use of EHR Data. Appl Clin Inform. 2016;7:69–88. doi: 10.4338/ACI-2015-08-RA-0107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ohno-Machado L, Agha Z, Bell DS, et al. pSCANNER: patient-centered Scalable National Network for Effectiveness Research. Journal of the American Medical Informatics Association. 2014;21:621–6. doi: 10.1136/amiajnl-2014-002751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kohane IS, Churchill SE, Murphy SN. A translational engine at the national scale: informatics for integrating biology and the bedside. J Am Med Inform Assoc. 2012;19:181–5. doi: 10.1136/amiajnl-2011-000492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Master S, Koshy N, Mansour R, Shi R. Effect of Stem Cell Transplant on Survival in Adult Patients With Acute Lymphoblastic Leukemia: NCDB Analysis. Anticancer Research. 2019;39:1899–906. doi: 10.21873/anticanres.13298. [DOI] [PubMed] [Google Scholar]
- 25.Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Medical care. 1998:8–27. doi: 10.1097/00005650-199801000-00004. [DOI] [PubMed] [Google Scholar]
- 26.Wermann WK, Viardot A, Kayser S, et al. Comorbidities are frequent in older patients with de novo acute lymphoblastic leukemia (ALL) and correlate with induction mortality: analysis of more than 1200 patients from GMALL data bases. Am Soc Hematology. 2018.
- 27.Terwilliger T, Abdul-Hay M. Acute lymphoblastic leukemia: a comprehensive review and 2017 update. Blood Cancer J. 2017;7:e577. doi: 10.1038/bcj.2017.53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wei WQ, Leibson CL, Ransom JE, et al. Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus. J Am Med Inform Assoc. 2012;19:219–24. doi: 10.1136/amiajnl-2011-000597. [DOI] [PMC free article] [PubMed] [Google Scholar]

