TABLE 1.
Data Quality Dimension | Frameworks and Guidance | Definition |
---|---|---|
Relevance Availability, sufficiency, representativeness |
Flatiron Health RWD | Availability of critical variables (exposure, outcomes, covariates) and sufficient numbers of representative patients within the appropriate time period to address a given use case |
EMA | Extent to which a data set presents data elements useful to answer a research question Extensiveness, including coverage: amount of information available with respect to what exists in the real world, whether it is within the capture process or not |
|
NICE | Determined by whether (1) the data provide sufficient information to produce robust and relevant results and (2) results are generalizable to patients in the NHS | |
FDA | Availability of key data elements (exposure, outcomes, covariates) and sufficient numbers of representative patients for the study | |
Duke-Margolis | Assessment of whether the data adequately address the applicable regulatory question or requirement, in part or in whole. Includes whether the data capture relevant information on exposures, outcomes, and covariates, and whether the data are generalizable | |
PCORI | Contextual data quality features are described as entailing unique contextual or task-specific data quality requirements | |
Reliability | Flatiron Health RWD | Degree to which the data represent the clinical concept intended, inclusive of data accuracy, completeness, provenance, and timeliness |
EMA | The dimension that covers how closely the data reflect what they are designed to measure. It covers how correct and trustworthy the data are | |
NICE | The ability to get the same or similar result each time a study is repeated with a different population or group | |
FDA | Data accuracy, completeness, provenance, and traceability | |
Duke-Margolis | Considers whether the data adequately represent the underlying medical concepts they are intended to represent; encompasses data accrual and data quality control (data assurance) | |
PCORI | Intrinsic features of data values are described as features of quality that involve only the data values “in their own right” without reference to external requirements or tasks | |
Accuracy | Flatiron Health RWD | Closeness of agreement between the measured value and the true value of what is intended to be measured |
EMA | Amount of discrepancy between data and reality Precision: degree of approximation by which data represent reality |
|
NICE | How closely the data resemble reality | |
FDA | Closeness of agreement between the measured value and the true value of what is intended to be measured Validation: the process of establishing that a method is sound or that data are correctly measured, usually according to a reference standard |
|
Duke-Margolis | Assessment of the validity, reliability, and robustness of a data field | |
PCORI | Not defined; concepts of plausibility, conformance, and consistency are described as alternatives | |
Conformance | Flatiron Health RWD | Compliance of data values with internal relational, formatting, or computational definitions or internal or external standards |
EMA | Assesses coherence toward a specific reference or data model | |
NICE | Whether the recording of data elements is consistent with the data source specifications | |
FDA | Data congruence with standardized types, sizes, and formats | |
Duke-Margolis | Congruence with standardized types, sizes, and formats; how compliant the data are with internal relational, formatting, or computational definitions or standards | |
PCORI | Compliance of the representation of data against internal or external formatting, relational, or computational definitions. Data values align to specified standards and formats | |
Plausibility | Flatiron Health RWD | Believability or truthfulness of data values |
EMA | Likelihood of some information being true; a proxy to detect errors | |
NICE | Not defined | |
FDA | The believability or truthfulness of data values | |
Duke-Margolis | Recorded values are logically believable given data source and expert opinion | |
PCORI | Believability of data values (uniqueness, atemporal, temporal plausibility) | |
Consistency | Flatiron Health RWD | Stability of a data value within a data set or across linked data sets or over time |
EMA | Coherence: how different parts of overall data sets are consistent in their representation and meaning. Subdimensions include format coherence, structural coherence, semantic coherence, and uniqueness Uniqueness: same information is not duplicated but appears in the data set once |
|
NICE | Agreement in patient status in records across the data sources | |
FDA | Included as part of the definition of data integrity: completeness, consistency, and accuracy of data | |
Duke-Margolis | Stability of a data value within a data set or across linked data sets | |
PCORI | Consistency is included as a subcategory of plausibility and conformance | |
Completeness | Flatiron Health RWD | Presence of data values (data value frequencies, without reference to actual values themselves) |
EMA | Extensiveness, including completeness: amount of information available with respect to total information that could be available, given the capture process and data format | |
NICE | Percentage of records without missing data at a given time point | |
FDA | The “presence of the necessary data” | |
Duke-Margolis | Measure of recorded data present within a defined data field and/or data set The frequencies of data attributes present in a data set without reference to data values |
|
PCORI | Frequencies of data attributes present in a data set, without reference to data values | |
Provenance | Flatiron Health RWD | An audit trail that accounts for the origin of a piece of data (in a database, document, or repository) together with an explanation of how and why it got to the present place |
EMA | Not defined | |
NICE | Describes the ability to trace the origin of data and identify how it has been altered and transformed throughout its lifecycle. It provides an understanding of the trustworthiness or reliability of a data source | |
FDA | An audit trail that “accounts for the origin of a piece of data (in a database, document, or repository) together with an explanation of how and why it got to the present place” Traceability: permits an understanding of the relationships between the analysis results (tables, listings, and figures in the study report), analysis data sets, tabulation data sets, and source data |
|
Duke-Margolis | Origin of the data, sometimes including a chronologic record of data custodians and transformations Traceability: ability to record changes to location, ownership, and values Data accrual: the process by which data are collected and aggregated (includes provenance) Data lineage: the history of all data transformations (eg, recoding or modifying variables) |
|
PCORI | Not defined | |
Timeliness | Flatiron Health RWD | Data are collected and curated with acceptable recency such that the data set represents reality during the period of coverage |
EMA | Availability of data at the right time for regulatory decision making, that in turn entails that data are collected and made available within an acceptable time Currency: considers freshness of the data, eg, current and immediately useful Lateness: aspect of data being captured later than expected corresponding to reality |
|
NICE | Lag time between data collection and availability for research | |
FDA | Not defined | |
Duke-Margolis | Longitudinality: condition of data indexed by time/interval of exposure and outcome time | |
PCORI | Not defined |
NOTE. Duke-Margolis definitions are synthesized from both the August 2019 and October 2018 white papers.23,24
Abbreviations: EMA, European Medicines Agency; FDA, US Food and Drug Administration; NHS, National Health Service; NICE, National Institute for Health and Care Excellence; PCORI, Patient-Centered Outcomes Research Institute; RWD, real-world data.