Abstract
Objective
Development of systematic approaches for understanding and assessing data quality is becoming increasingly important as the volume and utilization of health data steadily increases. In this study, a taxonomy of data defects was developed and utilized when automatically detecting defects to assess Medicaid data quality maintained by one of the states in the United States.
Materials and Methods
There were more than 2.23 million rows and 32 million cells in the Medicaid data examined. The taxonomy was developed through document review, descriptive data analysis, and literature review. A software program was created to automatically detect defects by using a set of constraints whose development was facilitated by the taxonomy.
Results
Five major categories and seventeen subcategories of defects were identified. The major categories are missingness, incorrectness, syntax violation, semantic violation, and duplicity. More than 3 million defects were detected indicating substantial problems with data quality. Defect density exceeded 10% in five tables. The majority of the data defects belonged to format mismatch, invalid code, dependency-contract violation, and implausible value types. Such contextual knowledge can support prioritized quality improvement initiatives for the Medicaid data studied.
Conclusions
This research took the initial steps to understand the types of data defects and detect defects in large healthcare datasets. The results generally suggest that healthcare organizations can potentially benefit from focusing on data quality improvement. For those purposes, the taxonomy developed and the approach followed in this study can be adopted.
Keywords: data quality, data defect, defect taxonomy, healthcare administration, Medicaid management information system
INTRODUCTION
Background and significance
Collecting, maintaining, and leveraging data to support decision making and daily operations are important targets in healthcare organizations.1–4 Consistent with Moore’s and Kryder’s laws of exponential increase of computational power and information storage, healthcare data have seen a rapid growth.5 In addition, with better methods of extracting information, translating information to knowledge, and deriving appropriate actions, the value of healthcare data as well as the number of data users have increased and are expected to increase even more rapidly in the near future.6 Accompanying these trends, data quality problems are uncovered at an increasing rate,7–10 presenting challenges to the healthcare organizations in leveraging their data assets.11–14
In this context, data quality is generally defined through fitness for use (ie, serving the needs of users pursuing certain goals).15 Incorrect, inconsistent, or missing data are examples of data quality problems. According to a study from Oracle, healthcare providers lose on average of $70.2 million annually, 15% of potential revenue per hospital, because of their inability to interpret and translate data into actionable insight due to the poor quality of the large volumes of data they collect.16 Generally speaking, it can be argued that a lack of data quality detracts from the quality, effectiveness, and efficiency of healthcare services by leading to imprecise, useless, or even misleading results and suboptimal decision making.17–19
The reasons behind the lack of data quality are often multifaceted and challenging. Various information technology (IT) software development and adoption problems such as software design flaws (eg, no input validation in user interfaces), lack of documentation,20,21 lack of user training,22,23 or delays in system updates24 can negatively impact data quality. However, the essential problems are arguably associated with the basic laws of software evolution:25–28 Software systems actively used in real world face constant pressure from the environment to accommodate changing and new requirements such as new healthcare workflows, policies, regulations, and laws. However, meeting those requirements during the evolution of a software system becomes increasingly more difficult and costly due to typically increasing software size and complexity.29 Software improvements, upgrades, and fixes often performed with limited budgets might easily overlook ensuring that software (eg, user interface) is of high quality, documentation is updated, sufficient training is provided, or the system receives and validates data correctly. In addition, it is important to guarantee that various data import, export, migration, and transportation operations avoid incorrectly modifying data. The previously mentioned challenges can be exacerbated by the separation of data creators and data users. In this quite common scenario, while data users experience the data quality problems and suffer the consequences,30 those who create data might lack the same concerns, interest, and motivation to address the problems.
Therefore, continuous monitoring of data quality becomes a critical first step to be able to provide useful feedback into organizational IT software adoption and maintenance processes. While a few prior studies focused on dirty data inspection by using ad hoc methods,31–33 generally, there is a need for systematic approaches for understanding and assessing data quality. Consequently, the current initiatives in healthcare organizations are often carried out in an ad hoc manner. To advance the state of the art, contributing to the knowledge about the types of data quality problems is essential. Such knowledge can facilitate the communication within the organization while detecting and resolving data defects. In addition, obtaining evidence about the prevalence of problems is important for raising knowledge and awareness about data quality, which in turn, can facilitate the initiatives aiming to improve data quality in health organizations.
Objectives
This research (1) developed a comprehensive taxonomy for data defects (henceforth referred to as defects) and (2) used the taxonomy to assess the prevalence of defects in the real-life health datasets used for healthcare administration purposes by the Department of Health in 1 of the states in the United States. A defect refers to a deviation from an expectation placed on data for achieving or maintaining fitness for use. For a set of stated expectations, higher numbers of defects are associated with lower data quality, and vice versa.
MATERIALS AND METHODS
Data
The department expressed a need to understand the quality of the data residing in the Medicaid Management Information System (MMIS), mainly in its Procedure and Provider subsystems, which include data about the Medicaid procedures and providers, respectively. Since the adoption of MMIS, over the last 3 decades, the data in these subsystems were mostly entered by the end users, who were the Department employees, through the user interfaces made available to them. The end users have received data from a variety of sources including article forms, other individuals, and other systems.
Table 1 shows the basic information about the datasets which were organized as tables. Therefore, this article adopts the traditional language associated with the standard tabular organization of datasets by using commonly understood terms such as table, row (or record), column, cell (at the intersection of a column and a row), and value (datum in a cell). As seen in Table 1, the MMIS data examined in this study included more than 32 million cells.
Table 1.
Medicaid datasets
Subsystem | Table name | Columns | Rows | Cells |
---|---|---|---|---|
Procedure | Claim Type | 3 | 242 | 726 |
Procedure | Coverage Group | 3 | 67 465 | 202 395 |
Procedure | Master | 55 | 35 076 | 1 929 180 |
Procedure | Modifier | 3 | 184 378 | 553 134 |
Procedure | Place of Service | 3 | 38 564 | 115 692 |
Procedure | Price | 8 | 371 177 | 2 969 416 |
Procedure | Provider Type | 3 | 39 093 | 117 279 |
Procedure | Specialty | 3 | 2 052 | 6 156 |
Provider | Address | 9 | 241 875 | 2 176 875 |
Provider | Category of Service | 12 | 375 541 | 4 506 492 |
Provider | Enrollment Period | 4 | 342 877 | 1 371 508 |
Provider | Group | 6 | 113 723 | 682 338 |
Provider | Lab Classification | 5 | 18 408 | 92 040 |
Provider | Master | 97 | 165 036 | 16 008 492 |
Provider | Receiver | 7 | 26 812 | 187 684 |
Provider | Specialty | 6 | 189 100 | 1 134 600 |
Provider | Supplement | 4 | 79 151 | 316 604 |
Total | 231 | 2 290 570 | 32 370 611 |
MMIS is a legacy system adopted in 1990s and saw major revisions in 2000s. Over the last 3 decades, its user interfaces were modified a number of times to accommodate changing Medicaid policies. Owing to the centrality of the functions it performs and data it manages, MMIS continues to play a critical role in the state’s health administration.
The selection and examination of Medicaid data in one of the states provide a realistic context and bring evidence to the issue of data quality, which has not been systematically explored, understood, or addressed in most health organizations. The Medicaid data hold tremendous potentials to support the management and delivery of health and health services for socioeconomically disadvantaged and underserved populations.34–36 However, the deficiencies in the Medicaid data often reduce its usefulness to improve operations and decision making.37–39 Therefore, working on Medicaid data serves a useful long-term purpose in addition to providing a real context for the study.
Taxonomy development
In many scientific areas, such as software systems40 and economics,41 taxonomies were identified as cognitive and management mechanisms in comprehending and organizing both newly emerging and prior knowledge. Within health care, taxonomies were used in various studies, such as those about grading the patient-centered evidence,42 evaluation of health technologies,43 and electronic health records systems adoption.44 The advantages of a taxonomy are often mentioned in the use of qualitative methods in health services research.45,46 To contribute to reducing medical errors and improve patient safety, a cognitive taxonomy was developed to categorize and explain prevent medical errors.47 Pursuing the first objective with similar motivations resulted in a comprehensive taxonomy comprised of defect categories and subcategories that are both stand-alone and interrelated. Taxonomy development involved 3 steps:
Document analysis: All of the available MMIS documents were examined to start learning about the expected data formats and values. This process enabled the identification of first sets of defects and defect categories via visual examination of data. The documents included the user guides, value description files, and matrix files: the user guide provided information about the field descriptions in the 3 and 7 display screens of the Procedure and Provider subsystems, respectively; the value description files provided information about the full names about the variables in the datasets and their valid values; and the matrix files defined dependencies among certain data items.
Descriptive analysis: This step was performed to detect extreme or abnormal values in an effective and efficient manner. As not every such condition is necessarily a defect, further inquiries with the state officials were conducted as necessary. For analysis, the data tables were imported into the statistical environment, R (version 3.3.3) (R Foundation for Statistical Computing, Vienna, Austria), by paying close attention to the data formats. After importing the data, descriptive summary results were generated for each table, which included the number of rows, number of missing values, number of unique values, lowest values, highest values, means, medians, and percentiles. In the light of the information gathered from the document analysis, each variable in the dataset was examined to identify potential violations by recognizing no-values, unexpected symbols such as a comma or a period in name field, and abnormal values such as “01/01/1901” or “12/31/9999” in date fields. Consequently, this examination on the Medicaid data supported the development and refinement of the defect taxonomy.
Literature review: To achieve a more comprehensive categorization of defects and consistency with the prior studies, a literature review was conducted by searching for relevant keywords such as data quality, data cleansing, dirty data, data defect, and data repair on Google Scholar and PubMed. Based on relevance, 15 articles were selected from 160 initially identified articles after reviewing their titles and abstracts. As a result, a reconciled, refined, and finalized defect taxonomy with the major and subcategories emerged.
Defect prevalence
A software program was developed to detect defects by automatically identifying any violations of a set of stated constraints in the datasets. A constraint clearly specifies an expectation stated for data. An example constraint for Medicare beneficiaries could be that any value appearing in a cell under the age column must be 65 or above. A defect represents the violation of a constraint within a cell. The cell containing a defect is called a defective cell. A defective cell can be associated with multiple defects because there can be multiple violations for that cell. In the previous example, an additional constraint allowing no missing value for age would result in the possibility of detecting 0, 1, or 2 defects for a defective cell under the age column.
By definition, the existence of defects depends on the existence and statement of constraints that specify known expectations from the data. To the extent that the latter is known and complete, the former can be detected accurately. Therefore, guided by the taxonomy, a meaningful list of column-level constraints (ie, constraints applying to all cells in the specified columns) was created. Writing constraints for individual cells is also possible, yet, it requires detailed specification of expectations at the individual cell level, which was neither feasible nor needed in this study.
In writing the constraints, the available MMIS documentation served as a source of reference. In addition, the researchers worked closely with a data steward who was highly familiar with the datasets and associated data quality issues as well as how the interaction of various stakeholders (eg, providers and end users) with MMIS affect data quality. As needed and inquired, the data steward searched for, located, and provided additional documents such as certain value description files. The data steward frequently answered questions over the email and phone during the creation of constraints and provided useful feedback and ideas. The constraints and results were shared with the data steward and other related department colleagues in a technical report and in presentations given during multiple in-person and online meetings that took place over the course of the study. The researchers incorporated the feedback to refine the constraints throughout the study.
By applying constraints to the data, the counts of defects and defective cells were obtained. To normalize for data volume, defect density, which is the number of defects in a table divided by the number of cells in that table, and defective cell density, which is the number of defective cells in a table divided by the number of cells in that table, were also calculated. The 95% confidence intervals48 were also calculated for the density measures.
The program was mainly developed using the Tool Command Language (Tcl).49 The use of Tcl as a high-level and interpreted scripting language facilitated50 program development. The program stores the data in the SQLite database.51 SQLite, as a serverless and standalone database, provided fast and reliable operations while facilitating the ease of development.52 The constraints were coded into the program as either a logical or regular expression by using the Standard Query Language. Multithreaded programming approach was adopted to improve performance.
RESULTS
Defect taxonomy
Figure 1 shows the taxonomy tree for the 5 main and 17 subcategories for defects discussed next: missingness53,54,56–59 indicates the absence of a set of values expected to exist. In our datasets, for example, there are some columns whose cells must never be empty under any circumstance, such as provider base number, provider number, which are primary keys in dataset, as well as other columns such as procedure name. The situation in which some value is absolutely required but missing is referred to as required-value missing.53,54,56–59 Conditionally–required-value missing subcategory refers to the absence of a value whose presence may be required or not depending on certain values appearing in other cells. For example, “Health Maintenance Organization” type must be filled when provider type is “HMO” and provider location is “00” in the same record. Dummy Entry58,59 is the value with no actual meaning (eg, the presence of value “000000000” in the provider Social Security number field).
Figure 1.
Taxonomy tree for data defects and programmer errors for various reasons such as calculation or data entry mistakes, which are called distortion.53–55
Incorrectness54,56–59 means that a value is outside of the set of values known to be correct. Implausible value53,54,56,59 refers to values outside of the range determined for the correct values, such as “1901-01-01” in provider service begin date, which is obviously improbable. Misspelling is the value with spelling error or typo. Misfielded value54,57–60 means a value shifted to a wrong column by mistake (eg, via wrong input or a programming bug that affected data. In addition, there are distortions, simply incorrect values introduced by users.
Syntax violation53,54,56 refers to a deviation from the required syntax. When a column can only contain a list of valid codes as values, the codes not included in this list will be invalid code.19,59–61 For example, “OO” is an invalid state code for a provider. Type mismatch53,54,57,58,62 refers to a situation where the value does not fulfill the requirement stated for data type (eg, “12,” a numeric value, appearing in the provider state column. Format mismatch occurs when some columns have constraints about the number of digits and specific combinations of alphabetical and numeric characters, yet, the value violates those constraints.53,54,57,62 For example, provider base number must be 7 digits; therefore, a 5-digit number found in dataset under the corresponding column represents a defect.
Semantic violation is related to the inconsistencies of information within and across columns. The dependency-contract violation subcategory53–56,58,59,62,63 indicates that the value is not in the value range or set, which is semantically determined by other columns. For instance, each provider specialty code determines a group of valid values for provider type; therefore, provider type cannot take a value outside of those determined by the provider specialty code in the same record. Another example is that, provider service start date cannot be a date later than the end of service date. In these 2 examples, 2 columns have semantic relationships creating certain constraints that should not be violated. Computational error53 takes place when value does not follow the computational relationships that need to be preserved with other columns (eg, values in one column always being a proportion of the values in the other column in the same row). Misleading abbreviation53,57–60 refers to abbreviation that can be interpreted in multiple ways, such as “Dr”, which could be interpreted as both “doctor” or “drive”; and “MD”, which can be “Maryland” or “Medical Doctor.” Different Unit54,58–60,64 indicates a numeric value in a measurement column which has a measurement unit different from that mainly used for the column (eg, use of pound instead of kilogram).
Duplication53,54,57,59,60,62 indicates a violation of a stated requirement of avoiding the same or near-same (approximate) values. Duplication across entities subcategory refers to 2 or more entities with same primary key(s), which are supposed to be unique keys in the dataset.53–55,57,59,60,62 Duplication across features means same value in multiple features of one same attribute. For example, in the provider category of service table, there are 8 columns for provider category of service code because each provider can provide at least 1 and at most 8 categories of service. Thus, values in these 8 columns must be unique across the 8 features. Also, for a particular pair of records, all approximate values which indicate one same record can be referred as synonym.59–61 The approximate values may be missing or mismatching a few characters, or there could be upper-lower case mismatches.
Defect prevalence
Table 2 shows the number of constraints associated with different defect types. Table 3 shows the number of defects detected by applying the set of constraints created. Overall, 3 151 743 defects were detected in 2 825 784 defective cells among 32 370 611 cells, with some cells having multiple defects. On average, 9.74 defects were detected in 8.73 cells for every 100 cells. Although less than the number of defects, the number of defective cells is considerably high. Defect count in a given cell varies from 1 to 21: 2 660 943 (94.17% of defective cells) cells include 1 defect per cell, while 164 415 (5.82%) cells include 2-10 defects per cell, and 426 (<0.02%) cells include 11-21 defects per cell.
Table 2.
Distribution of constraints
Subsystem | Table Name | Missingness |
Incorrectness | Syntax Violation |
Semantic Violation | Duplicity |
Total | |||
---|---|---|---|---|---|---|---|---|---|---|
Required Missing | Conditionally Missing | Implausible Value | Invalid Code | Format Mismatch | Dependency contract | DAE | DAF | |||
Procedure | Claim Type | 2 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 5 |
Coverage Group | 2 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 5 | |
Master | 5 | 5 | 3 | 30 | 1 | 3 | 1 | 0 | 48 | |
Modifier | 2 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 5 | |
Place of Service | 2 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 5 | |
Price | 4 | 1 | 2 | 1 | 0 | 1 | 0 | 0 | 9 | |
Provider Type | 2 | 0 | 0 | 2 | 0 | 3 | 0 | 0 | 7 | |
Specialty Code | 2 | 1 | 0 | 2 | 0 | 1 | 0 | 0 | 6 | |
Provider | Address | 6 | 0 | 2 | 3 | 2 | 4 | 1 | 0 | 18 |
Category of Service | 4 | 0 | 2 | 9 | 1 | 9 | 1 | 1 | 27 | |
Enrollment Period | 2 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 8 | |
Group | 4 | 0 | 2 | 1 | 1 | 1 | 1 | 0 | 10 | |
Lab Classification | 4 | 0 | 2 | 2 | 1 | 0 | 1 | 0 | 10 | |
Master | 17 | 2 | 16 | 24 | 7 | 21 | 1 | 1 | 89 | |
Receiver | 1 | 0 | 2 | 0 | 1 | 0 | 1 | 0 | 5 | |
Specialty | 6 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 12 | |
Supplement | 2 | 0 | 2 | 1 | 1 | 0 | 1 | 0 | 7 | |
Total | 67 | 13 | 35 | 87 | 17 | 45 | 10 | 2 | 276 |
DAE: duplication across entities; DAF: duplication across features.
Table 3.
Data defect counts and densities
Subsystem | Table Name | Missingness |
Incorrectness | Syntax Violation |
Semantic Violation | Duplicity |
Defects | Defect Density (%) | Defective Cells | Defective Cells Density (%) | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Required Missing | Conditionally Missing | Implausible Value | Invalid Code | Format Mismatch | Dependency Contract | DAE | DAF | ||||||
Procedure | Claim Type | 0 | 0 | NA | 61 | NA | NA | NA | NA | 61 | 8.40 ± 2.02 | 61 | 8.40 ± 2.02 |
Coverage Group | 0 | 0 | NA | 911 | NA | NA | NA | NA | 911 | 0.45 ± 0.03 | 911 | 0.45 ± 0.03 | |
Master | 2 | 865 | 2 | 104 501 | 0 | 1480 | 766 | NA | 107 616 | 5.58 ± 0.03 | 106 857 | 5.54 ± 0.03 | |
Modifier | 0 | 0 | NA | 87 818 | NA | NA | NA | NA | 87 818 | 15.88 ± 0.10 | 87 818 | 15.88 ± 0.10 | |
Place of Service | 0 | 0 | NA | 2153 | NA | NA | NA | NA | 2153 | 1.86 ± 0.08 | 2153 | 1.86 ± 0.08 | |
Price | 2 | 100 811 | 1 | 25 453 | NA | 436 | NA | NA | 126 703 | 4.27 ± 0.02 | 126 702 | 4.27 ± 0.02 | |
Provider Type | 0 | NA | NA | 3983 | NA | 2027 | NA | NA | 6010 | 5.12 ± 0.13 | 5464 | 4.66 ± 0.12 | |
Specialty Code | 0 | 0 | NA | 324 | NA | 651 | NA | NA | 975 | 15.84 ± 0.91 | 858 | 13.94 ± 0.87 | |
Provider | |||||||||||||
Address | 10 | NA | 0 | 21 338 | 95 516 | 105 124 | 0 | NA | 221 988 | 10.20 ± 0.04 | 173 765 | 7.98 ± 0.04 | |
Category of Service | 9 | NA | 8 | 50 535 | 75 095 | 38 502 | 0 | 0 | 164 149 | 3.64 ± 0.02 | 158 438 | 3.52 ± 0.02 | |
Enrollment Period | 0 | NA | 0 | 19 857 | 66 837 | 22 | 0 | NA | 86 716 | 6.32 ± 0.04 | 86 716 | 6.32 ± 0.04 | |
Group | 0 | NA | 1 | 517 | 24 275 | 52 | 0 | NA | 24 845 | 3.64 ± 0.04 | 24 845 | 3.64 ± 0.04 | |
Lab Classification | 0 | NA | 5 | 3631 | 1525 | NA | 0 | NA | 5161 | 5.61 ± 0.15 | 5161 | 5.61 ± 0.15 | |
Master | 71 445 | 0 | 402 364 | 571 667 | 641 527 | 446 140 | 0 | 60 | 2 133 203 | 13.33 ± 0.02 | 1 870 457 | 11.68 ± 0.02 | |
Receiver | 0 | NA | 0 | 0 | 6639 | NA | 0 | NA | 6639 | 3.54 ± 0.08 | 6639 | 3.54 ± 0.08 | |
Specialty | 0 | NA | 76 | 33 802 | 36 084 | 12 542 | 0 | NA | 152 360 | 13.43 ± 0.06 | 144 504 | 12.74 ± 0.06 | |
Supplement | 0 | NA | 0 | 3791 | 20 644 | NA | 0 | NA | 24 435 | 7.72 ± 0.09 | 24 435 | 7.72 ± 0.09 | |
Total | 71 468 | 101 676 | 402 457 | 930 342 | 968 142 | 606 976 | 766 | 60 | 3 151 743 | 9.74 ± 0.01 | 2 825 784 | 8.73 ± 0.01 |
Values are mean ± SD, unless otherwise indicated.
DAE: duplication across entities; DAF: duplication across features; NA: Not Applicable (due to no constraint defined).
DISCUSSION
The results revealed important data quality problems. Figure 2 depicts the defect counts and densities for the tables in the Procedure and Provider subsystems, while Figure 3 depicts the counts and densities of defective cells. Considering either measure, the results are largely consistent about the most defect-prone tables: The modifier and specialty codes tables in the Procedure subsystem and the specialty, master, and address tables in the Provider subsystem had the highest count and density values for defects and defective cells. In fact, the defect density in these tables exceed 10% which is arguably high for the health datasets expected to support operations and decision making. Such observations can be useful to inform prioritized initiatives planned for data quality improvement.
Figure 2.
Defect count and density for the tables in the (A) Procedure and (B) Provider subsystems with the 95% confidence intervals.
Figure 3.
Number of defective cells and defective cell density (defective cells per cell) for the tables in the (A) Procedure and (B) Provider subsystems with the 95% confidence intervals.
In addition, certain types of defects appear to be more prevalent: Therefore, prioritized initiatives focusing on detecting and fixing those categories of defects could potentially lead to higher return on investment. As shown in Figure 4, format mismatch, invalid code, dependency-contract violation, and implausible value categories have most contributed to the lack of data quality because of having the highest defect counts and defects per constraint.
Figure 4.
Number of defects and number of defects per constraint in each category and subcategory.
Format mismatch
More than 30% of defects fall into format mismatch category, detected by only 17 constraints. Wrong digit error was detected in columns such as provider base number, telephone number, and Social Security number. For example, provider base number is a 7-digit identification number for each provider; however, 36 970 records have provider base numbers which are not stated as 7-digit number in the provider master table.
Invalid code
About 30% of defects fall into invalid code, which is associated with the misuse of Medicaid codes and Medicaid indicators. It should be noted that all provider remittance media codes, all record codes, and more than 99.8% of Medicare part codes in the provider master table were wrong. There were 57 columns related to Medicaid codes and Medicaid indicators in the Procedure and Provider subsystems. Each column has a list of valid values, the number of valid values of each column varies from 2 to more than 100. The large use of Medicaid codes and Medicaid indicators, and similarities between them can mislead users and they would input wrong value by mistake. As the input validating features in MMIS have been mostly missing or ineffective, there is a high possibility that users provided invalid codes as input.
Dependency-contract violation
Next, almost 20% of defects are dependency-contract violations, which probably occurred due to the presence of mismatch between Medicaid codes. For example, each provider type code is associated with a group of valid provider category of service codes. In this case, the provider type and provider service code not only need to follow the syntax constraint, but also need to match with each other in a semantic manner. Usually, a value with syntax violation would also violate the dependency-contract violation constraint if it is dependent with another value.
Implausible value
About 30% of defects are implausible values. The most frequent case was that date entries were left empty by MMIS users to indicate “no start date” or “no end date.” There are many dates entered as “01/01/1901” or “12/31/9999”, which are not acceptable (eg, 166 171 “provider license withdraw dates” and 166 035 “federal first withheld dates” in the provider master table.
The likely impact of the high percentage of detected defects in an organization is that various ad hoc and systematic defect detection, correction, and prevention activities will take place. As done by our department colleagues, adopting ad hoc methods can be successful for detecting and correcting systemic defects (eg, defects caused by software bugs, because they present certain patterns). However, detecting nonsystemic defects (eg, those resulting from invalid or wrong data entry) requires implementing a systematic approach that monitors data quality and informs IT adoption processes such as end user training and organizational workflows. While doing so, certain uses of data that still lead to acceptable results can be documented and shared within the organization. For example, it may be determined that it is still acceptable to analyze certain large datasets with missing values due to random missingness. As another example, the incorrect values might be occurring above a value threshold, say, for the elderly beneficiaries; therefore, analyzing the other portion of the data for an investigation solely focusing on younger beneficiaries can be acceptable.
Limitations
The context of this study is largely defined by a government healthcare administration agency with strong reliance on large-scale legacy software systems. The datasets examined belong to MMIS and, potentially, they can be useful for supporting operations and decision making in the area of healthcare administration. Caution should be used in generalizing the results to other settings because the nature of data and the expectations placed on data about its fitness for use can be different. Yet, the approach and the lessons learned can be relevant to various health organizations such as hospitals and healthcare systems adopting multiple software systems (eg, electronic health records), which keep evolving over the years and constantly increase in size and complexity. In this study, the documentation related to data was quite limited. The set of constraints could be possibly extended with additional documentation. Nevertheless, the constraints defined by using the available resources still helped detect a large number of defects indicating the large scope of data quality problems for the examined datasets.
CONCLUSION
This research created a comprehensive taxonomy for data defects and detected more than 3 million data defects in the Medicaid datasets examined. By creating a taxonomy and using it, the study makes a contribution to an improved understanding and categorization of defects, which can facilitate the work of data stewards in health organizations toward improving data quality.
Overall, this study presents some contextual evidence from a real-world setting showing that data quality can be a vital concern for today’s health organizations which maintain data in large-scale software systems. To be better prepared, the data governance and management policies of healthcare organizations should include measures to periodically detect data defects as an important component of continuous data quality improvement processes.
FUNDING
This research was supported by the State Department of Health funding under DG_2016_0601_001 and DG_2016_0601_002.
AUTHOR CONTRIBUTIONS
GK conceptualized the study and received funding as the Principal Investigator. GK and YZ substantially participated in the study design, data collection and analysis, and writing and revising the manuscript.
CONFLICT OF INTEREST STATEMENT
None declared.
REFERENCES
- 1. Rosenbaum S. Data governance and stewardship: designing data stewardship entities and advancing data access. Health Serv Res 2010; 45 (5p2): 1442–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Patel VL, Kushniruk AW, Yang S, Yale JF.. Impact of a computer-based patient record system on data collection, knowledge organization, and reasoning. J Am Med Inform Assoc 2000; 7 (6): 569–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Dunkel B, Soparkar N. Data organization and access for efficient data mining. In: proceedings of the International Conference on Data Engineering (Cat. No. 99CB36337); 1999: 522–9.
- 4. Schroeder AT., Jr. Data mining with neural networks: solving business problems from application development to decision support. J Am Soc Inf Sci 1997; 48 (9): 862–3. [Google Scholar]
- 5. Dinov ID, Petrosyan P, Liu Z, et al. The perfect neuroimaging-genetics-computation storm: collision of petabytes of data, millions of hardware devices and thousands of software tools. Brain Imaging Behav 2014; 8 (2): 311–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Dinov ID. Volume and value of big healthcare data. J Med Stat Inform 2016; 4 (1): 3.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Sáez C, Zurriaga O, Pérez-Panadés J, Melchor I, Robles M, García-Gómez JM.. Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: a systematic approach to quality control of repositories. J Am Med Inform Assoc 2016; 23 (6): 1085–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Singer A, Yakubovich S, Kroeker AL, Dufault B, Duarte R, Katz A.. Data quality of electronic medical records in Manitoba: do problem lists accurately reflect chronic disease billing diagnoses? J Am Med Inform Assoc 2016; 23 (6): 1107–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Lee SJC, Grobe JE, Tiro JA.. Assessing race and ethnicity data quality across cancer registries and EMRs in two hospitals. J Am Med Inform Assoc 2016; 23 (3): 627–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Strong DM, Lee YW, Wang RY.. Data quality in context. Commun ACM 1997; 40 (5): 103–10. [Google Scholar]
- 11. Corsi DJ, Perkins JM, Subramanian S.. Child anthropometry data quality from Demographic and Health Surveys, Multiple Indicator Cluster Surveys, and National Nutrition Surveys in the West Central Africa region: are we comparing apples and oranges? Glob Health Action 2017; 10 (1): 1328185.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Price M, Davies I, Rusk R, Lesperance M, Weber J.. Applying STOPP guidelines in primary care through electronic medical record decision support: randomized control trial highlighting the importance of data quality. JMIR Med Inform 2017; 5 (2): e15.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Brennan PF, Stead WW.. Assessing data quality from concordance, through correctness and completeness, to valid manipulatable representations. J Am Med Inform Assoc 2000; 7 (1): 106–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Tickner N, Ockner M.. Preventing Death and Injury from Medical Errors Requires Dramatic, Systemwide Changes. Press Release. Washington, DC: Institute of Medicine, Division of Health Care Services; 1999. [Google Scholar]
- 15. Weiskopf NG, Weng C.. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 2013; 20 (1): 144–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Lewis N. Poor data management costs healthcare providers. Inf Week Healthc 2012. https://www.informationweek.com/healthcare/clinical-information-systems/poor-data-management-costs-healthcare-providers/d/d-id/1105481. Accessed August 22, 2019. [Google Scholar]
- 17. Christiansen-Lindquist L, Silver RM, Parker CB, et al. Fetal death certificate data quality: a tale of two US counties. Ann Epidemiol 2017; 27 (8): 466–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Lee CH, Yoon HJ.. Medical big data: promise and challenges. Kidney Res Clin Pract 2017; 36 (1): 3–11.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Yakout M, Elmagarmid AK, Neville J, Ouzzani M, Ilyas IF.. Guided data repair. Proc VLDB Endow 2011; 4 (5): 279–89. [Google Scholar]
- 20. Botsis T, Hartvigsen G, Chen F, Weng C.. Secondary use of EHR: data quality issues and informatics opportunities. Summit Transl Bioinforma 2010; 2010: 1–5.. [PMC free article] [PubMed] [Google Scholar]
- 21. Fowles JB, Lawthers AG, Weiner JP, Garnick DW, Petrie DS, Palmer RH.. Agreement between physicians’ office records and Medicare part B claims data. Health Care Financ Rev 1995; 16 (4): 189–99.. [PMC free article] [PubMed] [Google Scholar]
- 22. Van Der Bij S, Khan N, Ten Veen P, De Bakker DH, Verheij RA.. Improving the quality of EHR recording in primary care: a data quality feedback tool. J Am Med Inform Assoc 2017; 24 (1): 81–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Porcheret M, Hughes R, Evans D, et al. Data quality of general practice electronic health records: the impact of a program of assessments, feedback, and training. J Am Med Inform Assoc 2004; 11 (1): 78–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Ash JS, Berg M, Coiera E.. Some unintended consequences of information technology in health care: the nature of patient care information system-related errors. J Am Med Inform Assoc 2003; 11 (2): 104–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Lehman MM. Programs, life cycles, and laws of software evolution. Proc IEEE 1980; 68 (9): 1060–76. [Google Scholar]
- 26. Lehman MM, Belady LA.. Program Evolution: Processes of Software Change. San Diego, CA: Academic Press Professional, Inc; 1985. [Google Scholar]
- 27. Lehman MM, Ramil JF, Wernick PD, Perry DE, Turski WM. Metrics and laws of software evolution-the nineties view. In: proceedings Fourth International Software Metrics Symposium; 1997: 20–32.
- 28. Drouin N, Badri M.. Investigating the applicability of the laws of software evolution: a metrics based study In: Filipe J, Maciaszek LA, eds. ENASE 2013: Evaluation of Novel Approaches to Software Engineering. New York, NY: Springer; 2013: 174–89. [Google Scholar]
- 29. Banker RD, Datar SM, Kemerer CF, Zweig D.. Software complexity and maintenance costs. Commun ACM 1993; 36 (11): 81–94. [Google Scholar]
- 30. Leonard CE, Brensinger CM, Nam YH, et al. The quality of Medicaid and Medicare data obtained from CMS and its contractors: implications for pharmacoepidemiology. BMC Health Serv Res 2017; 17 (1): 304.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Rabia L, Amarouche IA, Bey KB. Rule-based approach for detecting dirty data in discharge summaries. In: proceedings of the 2018. International Symposium on Programming and Systems (ISPS); 2018: 1–6.
- 32. Cao H, Ma R, Ren H, Ge SS.. Data-defect inspection with kernel-neighbor-density-change outlier factor. IEEE Trans Automat Sci Eng 2018; 15 (1): 225–38. [Google Scholar]
- 33. Hudson CL, Topaloglu U, Bian J, Hogan W, Kieber-Emmons T.. Automated tools for clinical research data quality control using NCI common data elements. AMIA Jt Summits Transl Sci Proc 2014; 2014: 60–9. [PMC free article] [PubMed] [Google Scholar]
- 34. McManus BM, Rapport MJ, Richardson Z, Lindrooth R.. Therapy use for children with developmental conditions: analysis of Colorado Medicaid data. Pediatr Phys Ther 2017; 29 (3): 192–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Palmsten K, Huybrechts KF, Kowal MK, Mogun H, Hernández-Díaz S.. Validity of maternal and infant outcomes within nationwide Medicaid data. Pharmacoepidemiol Drug Saf 2014; 23 (6): 646–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Castillo VH, Martínez-García AI, Pulido J.. A knowledge-based taxonomy of critical factors for adopting electronic health record systems by physicians: a systematic literature review. BMC Med Inform Decis Mak 2010; 10 (1): 60.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Hennessy S, Leonard CE, Palumbo CM, Newcomb C, Bilker WB.. Quality of medicaid and medicare data obtained through Centers for Medicare and Medicaid Services (CMS). Med Care 2007; 45 (12): 1216–20. [DOI] [PubMed] [Google Scholar]
- 38. Iezzoni LI. Assessing quality using administrative data. Ann Intern Med 1997; 127 (8_Part_2): 666–74. [DOI] [PubMed] [Google Scholar]
- 39. Federspiel CF, Ray WA, Schaffner W.. Medicaid records as a valid data source: the Tennessee experience. Med Care 1976; 14 (2): 166–72. [DOI] [PubMed] [Google Scholar]
- 40. Mehta NR, Medvidovic N, Phadke S. Towards a taxonomy of software connectors. In: Proceedings of the 22nd International Conference on Software Engineering New York, NY: ACM; 2000: 178–87.
- 41. Lai LW. As planning is everything, it is good for something!: Coasian economic taxonomy of modes of planning. Planning Theory 2016; 15 (3): 255–73. [Google Scholar]
- 42. Ebell MH, Siwek J, Weiss BD, et al. Strength of recommendation taxonomy (SORT): a patient-centered approach to grading evidence in the medical literature. J Am Board Fam Pract 2004; 17 (1): 59–67. [DOI] [PubMed] [Google Scholar]
- 43. Brennan A, Chick SE, Davies R.. A taxonomy of model structures for economic evaluation of health technologies. Health Econ 2006; 15 (12): 1295–310. [DOI] [PubMed] [Google Scholar]
- 44. Adler-Milstein J, Salzberg C, Franz C, Orav EJ, Bates DW.. The impact of electronic health records on ambulatory costs among Medicaid beneficiaries. Medicare Medicaid Res Rev 2013; 3 (2): mmrr.003.02.a03. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Bradley EH, Curry LA, Devers KJ.. Qualitative data analysis for health services research: developing taxonomy, themes, and theory. Health Serv Res 2007; 42 (4): 1758–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Sofaer S. Qualitative methods: what are they and why use them? Health Serv Res 1999; 34 (5 Pt 2): 1101–18.. [PMC free article] [PubMed] [Google Scholar]
- 47. Zhang J, Patel VL, Johnson TR, Shortliffe EH.. A cognitive taxonomy of medical errors. J Biomed Inform 2004; 37 (3): 193–204. [DOI] [PubMed] [Google Scholar]
- 48. Gardner MJ, Altman DG.. Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ 1986; 292 (6522): 746–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Ousterhout JK, Jones K.. TCL and the TK Toolkit London, United Kingdom: Pearson Education; 2009.
- 50. Scott WS, Ousterhout JK. Magic’s circuit extractor. In: Proceedings of the 22nd ACM/IEEE Design Automation Conference Piscataway, NJ: IEEE Press; 1996: 286–92.
- 51. Owens M, Allen G.. SQLite. Berlin, Germany: Springer; 2010. [Google Scholar]
- 52. Owens M. The Definitive Guide to SQLite. New York, NY: Apress; 2006. [Google Scholar]
- 53. Gschwandtner T, Gärtner J, Aigner W, Miksch S.. A taxonomy of dirty time-oriented data In: Quirchmayr G, Basl J, You I, Xu L, Weippl E, eds. CD-ARES 2012: Multidisciplinary Research and Practice for Information Systems. New York, NY: Springer; 2012: 58–72. [Google Scholar]
- 54. Oliveira P, Rodrigues F, Henriques PR. A formal definition of data quality problems. Presented at: International Conference on Innovation Quality (MIT IQ Conference); November 10–12, 2005; Cambridge, MA.
- 55. Lee ML, Lu H, Ling TW, Ko YT. Cleansing data for mining and warehousing. In: Bench-Capon TJM, Soda G, Tjoa AM, eds. DEXA 1999: Database and Expert Systems Applications New York, NY: Springer; 1999: 751–60.
- 56. Barateiro J, Galhardas H.. A survey of data quality tools. Datenbank-Spektrum 2005; 14 (15–21): 48. [Google Scholar]
- 57. Müller H, Freytag JC. Problems, methods, and challenges in comprehensive data cleansing. Professoren des Institut Für Informatik; Germany: Humboldt-Universitat zu Berlin;2005.
- 58. Rahm E, Do HH.. Data cleaning: problems and current approaches. IEEE Data Eng Bull 2000; 23 (4): 3–13. [Google Scholar]
- 59. Kim W, Choi BJ, Hong EK, Kim SK, Lee D.. A taxonomy of dirty data. Data Min Knowl Discov 2003; 7 (1): 81–99. [Google Scholar]
- 60. Li L, Peng T, Kennedy J.. A rule based taxonomy of dirty data. J Comput 2018; 1 (2). [Google Scholar]
- 61. Wei W, Zhang M, Zhang B, Tang X. A data cleaning method based on association rules. In: ISKE (International Conference on Intelligent Systems and Knowledge Engineering). Paris, France: Atlantis Press; 2007: 1–5.
- 62. Naumann F. Data profiling revisited. Sigmod Rec 2014; 42 (4): 40–9. [Google Scholar]
- 63. Demsky B, Rinard M.. Automatic detection and repair of errors in data structures In: ACM SIGPLAN Notices: Proceedings of the OOPSLA ’03 Conference. vol. 38 New York, NY: ACM; 2003: 78–95. [Google Scholar]
- 64. Hernández MA, Stolfo SJ.. Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 1998; 2 (1): 9–37. [Google Scholar]