Abstract
The growth of administrative data repositories worldwide has spurred the development and application of data quality frameworks to ensure that research analyses based on these data can be used to draw meaningful conclusions. However, the research literature on administrative data quality is sparse, and there is little consensus regarding which dimensions of data quality should be measured. Here we present the core dimensions of the data quality framework developed at the Manitoba Centre for Health Policy, a world leader in the use of administrative data for research purposes, and provide examples and context for the application of these dimensions to conducting data quality evaluations. In sharing this framework, our ultimate aim is to promote best practices in rigorous data quality assessment among users of administrative data for research.
Keywords: data quality, administrative data, framework
INTRODUCTION
Recent developments in information technology have spurred the growth of data repositories and sparked renewed interest in the use of administrative data for research purposes.1 Administrative data, generally described as data derived from the operation of administrative systems (eg, the health care system) for the purpose of registration, transaction, and/or record-keeping,2 are advantageous for research because they do not require time-consuming participant consent or primary data collection.3 The many research possibilities for administrative data have generated “information-rich” environments in many research institutions, built on linking records across datasets and across sectors.4
As the use of routinely collected data in research becomes more common, data quality is an increasingly important consideration. Data quality is a context-specific concept defined broadly as “fitness for use.”5 Administrative data quality can be influenced by many factors: inconsistent methods of data collection, temporary coding problems when introducing new systems, and/or systematic biases in reporting.6 Establishing a rigorous data quality framework is critical to ensuring that comprehensive and consistent evaluations form the basis of increasingly complex data analyses, and it is particularly important for secondary users of information, as they often have little control over the data collection and maintenance processes.6 However, few research institutions have adopted structured frameworks to assess data quality, and there is little consensus in regard to which dimensions of data quality should be measured.7 In addition, the research literature on administrative data quality is sparse and has rarely included discussion of data quality evaluation methods.8
Therefore, in order to promote the use of rigorous data quality assessment and advance the research literature in this emerging area, we describe here the data quality framework developed and implemented at the award-winning Manitoba Centre for Health Policy (MCHP).9 In sharing this framework, our ultimate aim is to support best practices in evaluating administrative data for research purposes.
CASE DESCRIPTION
The Manitoba Population Research Data Repository at MCHP is a comprehensive population-based collection of administrative data, capturing information on virtually all residents (approximately 1.3 million individuals) of the province of Manitoba, Canada.10,11 The databases are grouped into 6 domains (health, education, social, justice, registries, and support files) and are updated on an annual or semiannual basis. The repository contains no personal identifying information, but datasets are linkable across files and over time by way of scrambled numeric identifiers. Research using linked repository datasets can describe and explain Manitoba residents’ patterns of health and health care use, social services use, and education and justice system contacts, and the findings can inform policy development and implementation in Manitoba.12 Developing a structured process to efficiently and comprehensively assess the quality of different kinds of data has been crucial to MCHP’s success as a leader in population health research.13
METHODS
To develop the MCHP data quality framework, we conducted a comprehensive search of the published and gray literature for information on data quality assessment practices in other Canadian and international research institutions. We reviewed data quality practices from the Canadian Institute for Health Information, the Public Health Agency of Canada, Statistics Canada, the Australian Bureau of Statistics, and the Institute for Clinical Evaluative Sciences (ICES) in Ontario, Canada. From these examples, we selected 5 key data quality dimensions based on their relevance to population-based research analyses and the availability of operational indicators at MCHP and incorporated them into our framework.
RESULTS
Data quality is a broad concept that is both relative and multidimensional in nature, as was evident in all of the frameworks we examined. Table 1 describes the dimensions included in data quality frameworks from Canadian institutions, which were the ones we found most helpful in developing the MCHP framework. To some degree, all of these data quality frameworks share common features. The concepts of accuracy (how well the data reflect the reality of what they were meant to measure) and timeliness (how current the data are) are included universally, and all frameworks also incorporate some measure of how “useful,” “serviceable,” or “relevant” the data are – that is, the degree to which the data meet the needs of users, although the exact definitions of this parameter varied.
Table 1.
Comparison of Canadian data quality frameworks
Dimensions | Canadian Institute for Health Information | Public Health Agency of Canada | Statistics Canada | Institute for Clinical Evaluative Sciences | Manitoba Centre for Health Policy |
---|---|---|---|---|---|
Accuracy | x | x | x | x | x |
Correctness | x | x | |||
Completeness | x | x | |||
Reliability | x | ||||
Reproducibility | x | ||||
Validity | x | x | |||
Measurement error | x | ||||
Level of bias | x | ||||
Consistency | x | ||||
Timeliness | x | x | x | x | x |
Comparability | x | ||||
Accessibility | x | ||||
Usability/serviceability/relevance/interpretability | x | x | x | x | x |
Coherence | x | ||||
Linkability | x |
To some extent, the dimensions included in each framework depend on the purpose of individual data repositories. For example, Statistics Canada assesses accessibility, or the ease with which the data can be obtained from the agency. This dimension is of much less concern for research institutions like MCHP, where accessing the data records is a highly regulated process. And while the ICES framework includes measures of anonymity and linkability, all datasets in the MCHP repository are both deidentified (personal information removed) and linkable (using scrambled numeric identifiers); therefore, including these dimensions in our data quality assessment would be redundant.
In developing MCHP’s data quality framework, we took a pragmatic approach to ensure that the quality assessments we conducted fit the scope of the available data, would be generalizable across different types of structured data, and could be conducted within the limitations of our legislative environment. Thus, the MCHP data quality framework is based on the dimensions of accuracy, internal validity, external validity, timeliness, and interpretability (Figure 1). A description of the 5 dimensions and their subcomponents follows, with examples of output presented to demonstrate their utility and research relevance.
- Accuracy: Accuracy is defined as the degree to which the data correctly describe the phenomena they were designed to measure,5 demonstrating their capability to support research conclusions.14 The concept of accuracy encompasses 5 subcomponents: completeness, correctness, measurement error, level of bias, and consistency.
- Completeness: MCHP measures completeness by the percentage of missing values in a given dataset field.9 Missing values may include blank fields for character variables or periods for numeric variables. The distinct concept of “missingness” assesses trends in missing data over time (Figure 2). This can be useful for indicating whether the amount of missing data is changing over time, and serves as an indicator of potential data quality problems.
- Correctness: Correctness is measured by the percentage of invalid codes (values that do not match provided formats), and invalid dates or out-of-range numeric values (values that fall outside the possible or established range).9 Outliers or extreme values are also flagged but not removed, as they do not always indicate poor quality. Instead, flags alert users that they should investigate possible reasons for the occurrence. MCHP uses the Canadian Institute for Health Information’s suggested rankings of minimal, moderate, or significant to categorize completeness and correctness.15 MCHP has also developed the valid, invalid, missing, outlier (VIMO) macro to evaluate correctness, generating output that includes variable labels and corresponding percentages of valid, missing, and outlier values (Figure 3).16 VIMO also generates the mean, minimum, maximum, median, and standard deviation for numeric variables, and a list of top 10 most frequent values for character variables.
- Measurement error: Measurement errors occur when data elements are attributable to incorrect answers or coding.15 Such errors can be caused by confusing definitions or weakness in data collection procedures. An example of measurement error is a data field where either “yes” or “no” would be appropriate, but instead it contains “b.” Or, in documenting cases of hypertension, a patient is erroneously listed as not being hypertensive because he or she is taking medication to manage his or her blood pressure. Good documentation and automated data collection methods can help reduce measurement error.15
- Level of bias: Bias refers to systematic differences between reported values and values that should have been reported.15 For example, sex- or age-specific biases can occur in datasets documenting certain types of chronic disease. While true bias is hard to establish concretely, other than through re-abstraction studies, possible biases can be detected when sampling errors occur or when coverage or responses are not complete. Correlated bias can occur when one data element is correlated with another, such as length of observation time with outcome, and is not generally assessed when acquiring data into the repository. If necessary, bias can be evaluated as part of the research enterprise.
- Consistency: Consistency, also referred to as reliability, is measured by the amount of variation that would occur if repeated measurements were done.15 Consistency is often an issue for subjective data elements that may not have a correct answer, such as a rating on a scale of 1–5, and is effectively evaluated in re-abstraction studies. MCHP often assesses measurement error, level of bias, and consistency at a granular level.9 This may also require linking data across several datasets, something that is only permitted when appropriate ethical approval has been obtained. For these reasons, such quality assessment is not typically conducted during the data acquisition phase.
- Internal validity: Internal validity measures the consistency between values in 2 data fields derived from the same source.17 At MCHP, measurements for internal validity include the subcomponents’ internal consistency, temporal consistency, and linkability (the ability to readily link 2 data files using a common key or identifier).
- Internal consistency: Internal consistency is a measure of the numeric agreement between fields or the logical relationship between fields.18 Examples of inconsistencies include a field noting a pregnant man or an 80-year-old woman having a baby. To measure such consistency, we use the VALIDATION macro to scan the dataset and count the number of data inconsistencies based on user-specified validation rules.16
- Temporal consistency: Temporal consistency is the degree to which a set of time-related observations conforms to a smooth line or curve over time and the percentage of observations that are classified as outliers from that line or curve.18 However, the trend analysis must be correctly informed before conclusions can be drawn. If a field such as “date of admission” in a trauma program was interpreted without accounting for historical changes in the program, the results could be misleading. MCHP’s TREND macro measures temporal consistency over a specified time (Figure 4).16 The macro fits a series of common models and selects the model with the minimum mean squared error, estimates studentized residuals for each observation, and flags significant observations as potential outliers.9 The macro also flags values as potential problems if it detects repeated observations with the same exact value (indicating no change over time).
- Linkability: MCHP defines linkability as the percentage of records having common identifiers in 2 or more administrative databases. Linkability is important for determining the data’s utility for research.18 Unique record identifiers that correspond to a personal health insurance number facilitate linkage based on deterministic matching.19 Personal health insurance numbers attached to records are scrambled before MCHP acquires the deidentified data. MCHP’s LINK macro shows the number and percentage of linkable records for a specific dataset or list of datasets.16
External validity: External validity refers to the relationship between the values in a data file and an external source of information, often referred to as “data confrontation.”18 For example, the level of agreement between summaries of the data with available literature, reports, and general knowledge can be an indicator of the quality of external validity. Outside content experts might also be consulted in this assessment.
Timeliness: Timeliness refers to how up to date the data are at the time of release.9 Timeliness reflects the time between data request and data acquisition, and the time between data acquisition and data release for research use. Long delays between acquisition and release might suggest resourcing issues that need to be resolved with data providers. The currency of documentation has also been added as a new indicator of timeliness.9 Metadata serve to inform users of important data characteristics and limitations. Decisions to use data must therefore be made on up-to-date documentation. Documentation currency is measured by the difference in time between data release and release of associated metadata.9
Interpretability: Interpretability is defined as the ease with which a user can understand the data.15 This is based on the quality of documentation provided, policies and procedures, formats, and metadata. If documentation is poor, data quality issues may go unrecognized.9 MCHP does not currently have operational measures for evaluating the interpretability of data, but is developing these for future use.
Figure 1.
Schematic of the Manitoba Centre for Health Policy's data quality framework.
Figure 2.
Example of data missingness. This output describes trends in missing data over several calendar years. Variables are listed on the Y-axis and years are listed on the X-axis. The cells contain the percent of missing data in that variable during each year. For example, there was 100% of data missing for mets_brain and mets_bone from 2005 to 2009, but then data capture increased so that only ≤30% was missing during the following years.
Figure 3.
Example of VIMO macro output: data correctness.
Figure 4.
Example of TREND macro analysis: data trends over time. This trend line shows participation in a treatment program in Winnipeg, Manitoba. Services increase steadily over time, with a dip in 2005. The trend line and regression are computed from measurements across the data points.
DISCUSSION
The growth of data repositories globally has necessitated the development and application of data quality frameworks to ensure that research using administrative data is based on sound, high-quality information. This case report describes the 5 core dimensions of MCHP’s data quality framework, which may serve as an exemplar for other research institutions working with administrative data and seeking to improve their data quality assessment process. Early work on the MCHP data quality framework informed the development and adoption of a data quality assessment framework by 2 other administrative data research institutions: ICES in Ontario, Canada,8 and the Secure Anonymized Information Linkage Databank in Swansea, UK.20 The information presented here has the potential to initiate this transformative process for other institutions worldwide.
Failing to monitor data quality has multiple repercussions: increased project duration, effort, and costs; poorly informed, biased, or outdated decision-making; damaged trust in study results; and decreased end-user satisfaction. Our rigorous framework mitigates these negative consequences and provides several advantages for researchers working with administrative data. First, it provides a metric for the degree to which data quality varies over time. These comparisons can help to determine whether specific fields exhibit sudden changes in missing values or numbers of cases. Second, the framework serves to communicate issues of data quality to data users, and does so in an accessible, user-friendly way. For example, the color-coded VIMO macro output serves as a kind of “dashboard” for assessing quality indicators of a particular dataset at a glance. The ability to easily and rapidly compare data quality across different datasets is increasingly important for studies involving multiple collaborators and/or spanning numerous jurisdictions. Finally, and uniquely among the other available data frameworks we examined, the macros and other coding tools developed as part of the MCHP framework are available for free under a General Public License,21 allowing interested users to adopt or adapt specific framework components to suit their individual needs.
Future directions
At MCHP, we are developing the capacity to use big data analytic techniques to unlock the potential of unstructured (or free-text) data, such as physicians’ notes in electronic medical records, emergency room triage notes, and imaging reports. Developing the technology to analyze these free-text data sources will add value and sophistication to traditional analytic approaches for structured administrative data, but will also present new challenges for data quality assessment.22,23 We will draw on expertise from other fields (including engineering, computer science, and mathematics) to develop appropriate techniques for assessing the data quality of unstructured health data.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for profit sector.
Competing Interests
The authors have no competing interests to declare.
Contributors
The need for the data quality framework described in this study was conceptualized by MS, LL, MA, and LR. MA led the literature review, MS, LL, and MA designed the framework, and SH implemented it. All authors participated in interpreting the framework outputs. The manuscript was drafted by JE and JO, and all authors read and revised the content critically. The final version was approved by all authors, who agree to be accountable for the work presented.
References
- 1. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014;23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Elias P. Administrative data. In: Dusa A, Nelle D, Stock G, Wagner G, ed. Facing the Future: European Research Infrastructures for the Humanities and Social Sciences. Berlin: SCIVERO; 2014:47–48. [Google Scholar]
- 3. Weiskopf N, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;201:144–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Roos LL, Gupta S, Soodeen RA, Jebamani L. Data quality in an information-rich environment: Canada as an example. Can J Aging. 2005;24 (Suppl 1):153–70. [DOI] [PubMed] [Google Scholar]
- 5. Statistics Canada. Statistics Canada's Quality Assurance Framework. http://www.statcan.gc.ca/pub/12-586-x/12-586-x2002001-eng.pdf. 2002. Accessed April 5, 2017. [Google Scholar]
- 6. Hirdes JP, Poss JW, Caldarello H, et al. An evaluation of data quality in Canada's Continuing Care Reporting System (CCRS): secondary analyses of Ontario data submitted between 1996 and 2011. BMC Med Inform Decis Mak. 2013;1327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Chen H, Hailey D, Wang N, Yu P. A review of data quality assessment methods for public health information systems. Int J Environ Res Public Health. 2014;115:5170–207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Iron K, Manuel DG. Quality Assessment of Administrative Data (QuAAD): An Opportunity for Enhancing Ontario's Health Data. http://www.ices.on.ca/∼/media/Files/Atlases-Reports/2007/Quality-assessment-of-administrative-data/Full%20report.ashx. 2007. Accessed April 5, 2017. [Google Scholar]
- 9. Azimaee M, Smith M, Lix L, Ostapyk T, Burchill C, Orr J. MCHP Data Quality Framework. Winnipeg, Manitoba: Manitoba Centre for Health Policy, University of Manitoba; 2015. [Google Scholar]
- 10. Roos LL, Nicol JP. A research registry: uses, development, and accuracy. J Clin Epidemiol. 1999;521:39–47. [DOI] [PubMed] [Google Scholar]
- 11. Roos LL Jr, Nicol JP, Cageorge SM. Using administrative data for longitudinal research: comparisons with primary data collection. J Chronic Dis. 1987;401:41–49. [DOI] [PubMed] [Google Scholar]
- 12. Jutte DP, Roos LL, Brownell MD. Administrative record linkage as a tool for public health research. Annu Rev Public Health. 2011;32:91–108. [DOI] [PubMed] [Google Scholar]
- 13. Roos LL, Menec V, Currie RJ. Policy analysis in an information-rich environment. Soc Sci Med. 2004;5811:2231–41. [DOI] [PubMed] [Google Scholar]
- 14. Richesson RL, Horvath MM, Rusincovitch SA. Clinical research informatics and electronic health record data. Yearb Med Inform. 2014;9:215–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. The Canadian Institute for Health Information. The CIHI Data Quality Framework. https://www.cihi.ca/en/data_quality_framework_2009_en.pdf. 2009. Accessed April 5, 2017. [Google Scholar]
- 16. Manitoba Centre for Health Policy. Data Quality Macros – Development Data Analysis Environment. http://umanitoba.ca/faculties/health_sciences/medicine/units/community_health_sciences/departmental_units/mchp/protocol/media/DQMacros_GPL3_Version2.pdf. 2013. Accessed April 5, 2017. [Google Scholar]
- 17. Cook TD, Campbell DT. Quasi-Experimentation: Design and Analysis Issues for Field Settings. 1st ed.Boston: Houghton Mifflin; 1979. [Google Scholar]
- 18. Lix LM, Smith S, Azimaee M, et al. A Systematic Investigation of Manitoba's Provincial Laboratory Data. http://mchp-appserv.cpe.umanitoba.ca/reference/cadham_report_WEB.pdf. 2012. Accessed April 5, 2017. [Google Scholar]
- 19. Roos LL, Wajda A. Record linkage strategies. Part I: estimating information and evaluating approaches. Methods Inf Med. 1991;302:117–23. [PubMed] [Google Scholar]
- 20. Jones KH, Ford DV, Jones C, et al. A case study of the Secure Anonymous Information Linkage (SAIL) Gateway: A privacy-protecting remote access system for health-related research and evaluation. J Biomed Inform. 2014;50100:196–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Manitoba Centre for Health Policy. Data Quality Resources. http://umanitoba.ca/faculties/health_sciences/medicine/units/chs/departmental_units/mchp/resources/repository/dataquality.html. 2016. Accessed April 5, 2017. [Google Scholar]
- 22. Carlo B, Daniele B, Federico C, Simone G. A data quality methodology for heterogeneous data. Int J Database Manage Syst. 2011;31:60–79. [Google Scholar]
- 23. Kiefer C. Assessing the quality of unstructured data: an initial overview. Proceedings of the LWDA, Potsdam, Germany: Hasso Plattner Institute, University of Potsdam; 2016. [Google Scholar]