Abstract
Background
Data quality is a complex and multifaceted concept with varying definitions depending on context. In healthcare, high-quality data is essential for clinical decision-making, patient outcomes, and research. Despite its importance, no universally accepted definition of data quality exists, and its assessment remains challenging due to the diversity of dimensions and methodologies involved. This systematic review aims to identify key dimensions of data quality in healthcare, examine methodologies used for assessment, and explore tools and software applications developed to evaluate data quality.
Methods
We searched three information databases namely PubMed, Web of Science, and Scopus for articles published up to November 11, 2024, that discussed dimensions, methods and developed tools for data quality assessment (DQA). We aimed to focus on the data quality dimensions (DQDs)evaluated in the included studies, the assessment methods applied, and the tools developed for evaluating healthcare data, and to systematically categorize these aspects.
Results
A total of 44 studies were included, revealing significant variation in the number and definitions of DQDs assessed, with completeness, plausibility, and conformance being the most frequently evaluated. Diverse methodologies were employed to assess these dimensions, including rule-based systems, statistical methods, enhanced definitions, and comparisons with external gold standards. The studies also highlighted a wide range of tools and software applications used to support DQA in healthcare.
Conclusion
Understanding and applying appropriate DQDs and assessment methods are critical for ensuring that healthcare data supports valid clinical and research outcomes. This review provides a foundation for selecting suitable evaluation frameworks and tools, thereby enhancing data quality management and utilization in healthcare settings.
Supplementary information
The online version contains supplementary material available at 10.1186/s12911-025-03136-y.
Keywords: Data quality dimensions, Data quality methods, Data quality tools, Framework, Data quality assessment
Background
Quality is a multifaceted concept that can hold different meanings depending on the specific context in which it is applied [1]. Various experts have offered diverse definitions of quality [2, 3]. Juran defines quality as “fitness for use” [4], while Crosby describes it as “conformance to requirements” [5]. A more comprehensive definition is provided by ISO 9001 (Quality Management System Standard), which defines quality as the “degree to which a set of inherent characteristics (or distinguishing features) of an object”—including products, services, processes, organizations, or systems—“fulfills requirements” [6].
The concept of data quality originated in the 1950s, initially focusing on manufacturing before expanding into service sectors, including healthcare [7]. despite its widespread use, a universally accepted definition of data quality remains elusive [8]. Various experts have offered different interpretations, often highlighting the subjective nature of quality. Data quality can be defined in several ways. Wand (1996) describes it as the ability of an information system to accurately represent any meaningful state of the real-world system in question [9]. Wang (1996) offers another perspective, defining data quality as the extent to which data possesses sufficient depth, breadth, and scope to perform designated tasks [10]. Wang’s conceptual framework classifies data quality into four dimensions: intrinsic, contextual, representational, and accessibility. Each of these dimensions plays a critical role in ensuring that healthcare data is fit for its intended purpose.
According to these definitions, high-quality data should be intrinsically accurate, contextually appropriate for its intended task, clearly represented, and easily accessible to the user [11]. Among the various definitions, the one provided by the world health organization (WHO) has received considerable attention. The WHO defines data quality as the ability of a system to achieve its intended objectives through lawful means and suggests that data quality reflects the alignment of data with the system’s goals and standards [7]. In healthcare, the definition of high-quality data varies among experts. According to the National Academy of Medicine (NAM), high-quality data is “data strong enough to support conclusions and interpretations equivalent to those derived from error-free data” [12].
The importance of data quality in healthcare has attracted increasing attention in recent years. The significance of data quality is particularly evident in its influence on clinical decision-making and patient outcomes. Shortliffe emphasizes the importance of data quality in clinical decision-making, asserting that inaccurate data can lead to serious adverse effects and underscoring the necessity of reliable clinical data [2]. The significance of data quality is widely recognized, particularly in healthcare, where it serves as a critical factor in clinical decision-making, healthcare service delivery, and medical research. High-quality data is crucial for informing both ongoing and future care at different levels of healthcare services, training healthcare professionals, and conducting clinical effectiveness research. Ensuring data quality not only enhances service delivery, reduces mortality rates, and minimizes medical errors, but also plays a pivotal role in data reuse, financial management, and legal compliance within healthcare systems [13]. With the growing volume of data from electronic health record(EHR)s, registries, and large-scale health initiatives, there is increasing potential for secondary use in clinical effectiveness research, quality improvement, and decision support [13].
The American Medical Informatics Association (AMIA) highlights that secondary use of health data can improve patient experiences, expand disease knowledge, enhance healthcare system efficiency, and support public health initiatives [14]. Furthermore, leveraging clinical data can advance biomedical sciences, genetics, and pharmaceutical research while also replicating findings from randomized controlled trials [15–18]. However, the increasing volume of data within EHRs and registries presents substantial challenges. Poor data quality can result in misleading conclusions, leading to error-prone decision-making processes [13, 19, 20].
Inaccurate or incomplete data, particularly in clinical settings, may produce misleading results and negatively impact both patient care and research outcomes [19, 21, 22]. Additionally, variable data quality introduces excessive noise, affecting the reliability and reproducibility of findings [20, 23]. Errors in large healthcare datasets highlight the need for rigorous DQA methods to ensure suitability for research and clinical applications [24]. Therefore, assessing and improving the quality of healthcare data is essential to ensure valid, reproducible, and actionable insights.
DQA follows two main approaches, the first approach is Global Data Quality Measures – Evaluate overall dataset quality and help determine if a dataset is generally suitable for use. However, they may lack detailed insights into fitness-for-use for specific applications. The second on is Fitness-for-Use Measures – Assess whether a dataset meets the requirements for a specific task or research objective. While highly valuable, these measures may not apply broadly to other use cases [12]. The aim of this systematic review is to identify the key dimensions used to assess data quality in healthcare, examine the methodologies employed to evaluate these dimensions, and explore the existing tools and software applications developed or utilized for this purpose. Accordingly, this study seeks to answer the following research questions:
RQ1:
What are the dimensions of data quality in healthcare?
RQ2:
Which methodologies were utilized to assess data quality?
RQ3:
What tools and software applications were utilized to assess data quality?
Overall, addressing these research questions can provide data users with a clear understanding of how to evaluate the quality of structured clinical datasets more accurately, efficiently, and effectively. By selecting appropriate DQDs, applying suitable assessment methods, and identifying existing tools and software, users will be better equipped to assess data quality in a practical and systematic manner.
Materials and methods: This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [25]. The aim was to identify studies that introduced specific methods or tools for assessing data quality in the clinical domain, particularly in relation to defined DQDs.
Search strategy: The systematic literature search was conducted to identify relevant articles by organizing the search strategy into two main categories: healthcare-related data concepts, and data quality assessment terms—covering DQ dimensions, methods, and tools.
These two categories included related keywords grouped using the OR operator within each block and combined using the AND operator across the blocks.
Articles were retrieved through a comprehensive search in three major electronic bibliographic databases: PubMed, Web of Science, and Scopus.
Due to the differences in search functionalities across each database, the search strategies were syntactically tailored for each one. The detailed search strategies used for PubMed, Web of Science, and Scopus are presented in Supplementary File 1.
Inclusion criteria: The inclusion criteria for this systematic review were as follows:
The study must specify one or more DQDs being evaluated.
The study must introduce a method and/or tool for assessing data quality.
The data under investigation had to be from the medical or healthcare domain and in a structured format.
Only studies published in the English language were included.
Exclusion Criteria: The exclusion of studies was based on the following criteria:
The study did not involve clinical tabular structured data, and instead focused on non-tabular or non-static datasets.
The study solely introduced a framework or process for data collection, integration, or management, without evaluating data quality dimensions, methods, or tools.
The study focused on comparative evaluation of data quality across multiple databases or tools, without introducing specific DQ dimensions or assessment approaches.
The study discussed general challenges, issues, or solutions for data quality improvement, without empirical assessment or methodological implementation.
The study evaluated the feasibility of data reuse or comparisons before and after the implementation of systems such as EHR, EMR, or HIS, without direct focus on data quality assessment.
The article was a literature review and did not present original data quality assessment findings.
The paper was not published in English, and therefore excluded per the eligibility criteria.
Other reasons, including lack of access to full text or insufficient methodological detail to meet inclusion criteria
Selection procedure and data extraction: Relevant articles were retrieved on November 11, 2024, and managed using EndNote 21. After removing duplicates, the selection of eligible studies was conducted in two screening stages. In the first stage, titles were screened, followed by abstract screening in the second stage. This process was carried out independently by two reviewers (E.H., and M.A.). Any disagreements were resolved through discussion with a third reviewer (H.T.). As a result, 44 articles were selected for full-text review. During this stage, all authors independently reviewed the studies. Key information relevant to the research questions was extracted and recorded in Microsoft Excel. The extracted data focused on elements that addressed the predefined research questions.
Evidence maps construction and visualization: To enhance interpretability, an evidence map was employed in this study to illustrate the relationships among data quality dimensions, assessment methods, and tools. For this purpose, two heatmaps were generated using the R programming environment. These heatmaps were designed to visualize and summarize key elements from the included studies.
The first heatmap was constructed as a matrix with 4 rows and 12 columns, representing the relationship between assessment methods (rows) and data quality dimensions (e.g., completeness, plausibility, conformance) displayed in the columns. The second heatmap was structured as a 4-by-6 matrix, illustrating the alignment between data quality frameworks (rows) and the tools used for data quality assessment (columns). In both heatmaps, color intensity reflects the frequency of occurrence or higher usage of a particular dimension, method, or tool across the included studies.
Quality assessment of studies: Given the aim of this study, obtaining reliable findings regarding the data quality assessment process requires a clear understanding of existing DQDs, the methods used to assess and measure these dimensions, and the tools employed for data quality evaluation. Recognizing the relevant dimensions is essential for implementing a meaningful data quality assessment process, as each dimension provides a specific perspective on the quality and usability of healthcare data. Furthermore, the tools developed for data quality assessment play a crucial role in facilitating and accelerating this process. These tools are particularly valuable as they can support both technical and non-technical users in evaluating data quality efficiently and effectively. By enabling broader accessibility, such tools contribute to more consistent and standardized assessments of data quality across different healthcare settings. Therefore, in this review study, we aimed to assess the quality of the included studies based on an evaluation process that assigned a score ranging from zero to one to each study, according to the Quality Assessment Criteria (QAC) [26]. The evaluation process for studies included in this systematic review involved assigning a score ranging from 0 to 1 to each study based on predefined QAC. A score of 1 indicated that the study provided a complete response to the targeted question, while a score of 0.75 represented an acceptable response. Studies that provided incomplete or partially relevant answers were assigned a score of 0.5, and those that failed to address the question received a score of 0.
Results: Search and study selection Process: Fig. 1 provides a summary of the scientific literature review process, detailing the number of studies included and excluded at each stage based on the PRISMA guidelines. A total of 614 studies were initially retrieved through systematic searches across three research databases: PubMed, Scopus, and Web of Science. After the screening and eligibility assessment phases, 44 studies [20, 27–69] were ultimately selected for detailed analysis.
Fig. 1.
The PRISMA diagram
Study characteristics: Fig. 2 illustrates the temporal trend of the included studies, highlighting the growing importance of DQA—particularly in the healthcare domain—in recent years.
Fig. 2.

Trend of studies over the years
The objectives of these studies can be classified into three broad categories:
The first category involves the development of tools, guidelines, or frameworks for assessing data quality. The second category pertains to the evaluation of data quality for secondary use, particularly in observational studies that utilize data collected from sources such as EHRs, EMRs, healthcare information systems (HIS), cohort lefts, and registries, with the goal of supporting clinical research. The third category focuses on the validation of data across data transfer processes (e.g., ETL – Extract, Transform, Load) from various sources to centralized data repositories (e.g., data warehouses), which is essential for ensuring data integration. Further details regarding this classification are provided in Supplementary File 2. The United States has contributed the largest number of studies and research in this area. Among European countries, Germany has shown the highest level of research activity related to healthcare data quality.
Figure 3 presents the distribution of studies by country and additional characteristics such as data source evaluated in studies, data types, and the software tools used for implementing DQA solutions.
Fig. 3.
Part(a) displays the distribution of data sources employed in the reviewed studies. Part(b) demonstrates the software implementations used for data quality assessment tools. Part(c) shows the geographic distribution of studies by country. Part(d) categorizes data type
DQDs: In the 44 studies reviewed, the number of dimensions considered for evaluation varied from 1 to 6. The terminology used for these dimensions, as well as the unique definitions provided for them, also showed considerable variation.
A detailed overview of the DQDs is depicted in Fig. 4. Moreover, their corresponding definitions is presented in the Supplementary File 3.
Fig. 4.
Frequency of DQDs in studies
Out of the 44 studies, 43 introduced specific dimensions for data quality. Among the 43 studies reviewed, 40 studies [20, 27–39, 41–47, 49–60, 62–66, 68, 69] (93%) examined the “completeness” dimension. The “plausibility” dimension was assessed in 21 studies [29, 30, 32, 34, 36, 39, 41–44, 49, 51, 52, 54, 55, 58, 60, 64, 65, 67, 68] (49%), while 11 studies [30, 32, 34, 36, 39, 41, 43, 44, 50, 55, 62] (26%) evaluated the “conformance” dimension. Likewise, the “accuracy” dimension was addressed in 11 studies [28, 31, 33, 37, 40, 45, 47, 48, 53, 60, 69] (26%). Another frequently assessed dimension was “correctness,” which appeared in 10 studies [20, 27, 38, 46, 49, 57, 59, 63, 65, 66] (23%). Similarly, the “consistency” dimension was evaluated in 10 studies [27, 38, 45, 47, 50, 52–55, 62] (23%). Nine studies [27, 45, 50, 51, 53, 54, 58, 64, 68] (21%) investigated the “uniqueness” dimension, and seven studies [33, 49, 55, 56, 58, 64, 65] (16%) examined “concordance.” The “currency” dimension was evaluated in six studies [20, 38, 46, 49, 57, 59] (14%), and “timeliness” was assessed in four studies [33, 40, 48, 62] (9%). Three studies [45, 53, 60] (7%) focused on the “validity” dimension, and another three studies [50, 51, 68] (7%) evaluated the “temporal relationship” dimension. Additional dimensions were also assessed across various studies. “comparability” [48, 59] and “relevance” [48, 49] were evaluated in separate studies. “amount of data” and “believability” were considered in studies [33, 40]. Two studies [51, 68] focused on the “Compatibility” dimension. The “temporal stability” dimension was examined in one study [27], as were the dimensions “redundancy” and “readability” in study [28]. One study [40] addressed the “objectivity” dimension, and another assessed “usability” [48]. Furthermore, separate studies evaluated the dimensions of “data standardization,” “data harmonization,” [54] and Hospital Information Software Recording Ability (HISRA) [69]. It is worth mentioning that, Becuse of overlapping terminology such as “accuracy” vs. “correctness”, conceptual distinctions between terms were often blurred in the literature. However, dimension names were retained as stated in the original studies, even when overlaps existed. In order to preserve the terminology used by the primary authors and maintain transparency, we did not merge dimensions.
DQA Methods: Among the 44 retrieved studies, 43 [20, 27–38, 40–68] introduced at least one method for assessing data quality. Out of these studies, only 2 [28, 54] studies used complex AI-based DQA methods. The various methods employed for assessing each data quality dimension, along with the studies where they were applied, are summarized in Table 1.
Table 1.
DQA methods
| Dimension | Method | Studies |
|---|---|---|
| completeness | Based on the ratio of completed fields to the total number of fields for each variable. | 27, 29, 30, 32, 37, 41–43, 52, 55, 60 |
| (Number of required fields − Number of missing required fields)/Number of required fields. | 35, 58 | |
| Calculation of data volume over time trends. | 65 | |
| Applying enhanced definitions. | 21, 31, 33, 34, 36, 38, 44, 46, 47, 49, 50, 56, 64 | |
| Utilizing a Rules Database. | 28, 54, 57, 59, 63, 66 | |
| Using statistical methods. | 45, 51, 53, 62, 68, 69 | |
| plausibility | Examining the logical coherence of one data element in relation to others. | 30, 41, 42, 58, 60, 64 |
| Assessing the believability of values within a variable. | 42, 43, 58 | |
| Applying enhanced definitions. | 34, 36, 44, 49 | |
| Utilizing a Rules Database | 32, 51, 67, 68 | |
| Using statistical methods. | 29, 32, 52, 54, 55, 65 | |
| conformance | Evaluating data compliance with predefined formats and relational constraints | 30 |
| Applying enhanced definitions | 34, 36, 43, 44, 50 | |
| Utilizing a Rules Database | 41, 55 | |
| Using statistical methods. | 32, 62 | |
| accuracy | Calculating the proportion of incorrect, illogical, or implausible values, including biologically unacceptable values. | 31, 47, 48, 69 |
| Internal validation through repeated measurements. | 31 | |
| Comparison with external gold standards, calculating metrics such as accuracy, sensitivity, and specificity. | 37 | |
| Advanced validation using data pattern analysis. | 60 | |
| Applying enhanced definitions. | 33 | |
| Utilizing a Rules Database | 40, 45, 53 | |
| Using statistical methods. | 28 | |
| correctness | Comparison with external gold standards and Time-Dependent Data Accuracy Profiling. | 27, 66 |
| Applying enhanced definitions. | 21, 38, 46, 49 | |
| Using statistical methods. | 57, 59, 63, 65, 66 | |
| consistency | Proportion of field values deviating from expected data types or formats. | 27, 47 |
| Internal consistency checks through comparison of related indicators. | 28, 52 | |
| Applying enhanced definitions. | 38, 50 | |
| Utilizing a Rules Database | 45, 53, 62 | |
| Using statistical methods. | 54, 55 | |
| Uniqueness | Uniqueness formula: (Total records − Duplicate records)/Total records. | 27, 58 |
| Identification of multiple activities recorded at the same time point for a single individual or source. | 64 | |
| Utilizing a Rules Database | 45, 50, 51, 53, 68 | |
| Using statistical methods. | 54 | |
| concordance | using detection of violations in attribute dependencies | 55, 64 |
| Matching with reference values | 58 | |
| applying enhanced definitions | 33, 49, 56 | |
| using statistical methods | 65 |
Other dimensions of data quality and the methods used to evaluate them are as follows. “currency” was assessed based on the definition -data should be entered on time and be sufficiently up to date- in [20, 38, 46, 49, 59] and visualization techniques [57]. “timeliness” was evaluated by using applying enhanced definitions [33, 48] and utilizing a rules database [40, 62]. The methods were employed to assess “validity” in studies were using external gold standards [60] and using Rule-based systems [45, 53]. To assess “comparability” methods using a statistical method [59] and improved definition for evaluation in studies were provided [48, 59]. “Amount of data” and “believability” were evaluated through enhanced definitions (domain analysis and range checking) [33] and using rule-based systems [40]. “Temporal relationship”, “compatibility” and “objectivity” were assessed in studies [40, 50, 51, 68] using rule-based systems. “Relevance” and “usability” have used the improved definition method in studies [48, 49] to assess quality. “Temporal Stability” was evaluated using visualization techniques and qualitative user assessments in study [27]. “Redundancy” was assessed through statistical correlation analysis (Spearman’s correlation coefficient for variable pairs) in study [28]. “Readability” was evaluated using clustering algorithms (unsupervised machine learning) in study [28]. “Data standardization” and “data Harmonization” were assessed using feature classification evaluation methods [54]. The “HISRA” dimension was calculated in study [69] using the formula: accuracy × completeness. Fig. 5 illustrates the relationship between DQDs and the methods used to assess these dimensions. As shown in the figure, existing formulas and definitions are predominantly utilized for evaluating DQD.
Fig. 5.
Relationship between DQDs and DQA methods
DQA Tools: Among the 44 studies retrieved for full-text review, only 24 studies [27–30, 32, 34, 35, 42–44, 48, 50–58, 63–66] (55%) introduced a specific tool for DQA. Of these 24 studies, 7 studies [30, 32, 34, 35, 43, 58, 66] (29%) were based on the framework developed by Kahn et al. Five studies [27, 48, 54, 55, 64] (21%) implemented their tools using other frameworks, while 2 studies [56, 57] (8%) were based on the guidelines introduced by Weiskopf et al., and another 2 studies [50, 51] (8%) drew upon the work of Wang et al. In 5 studies [28, 33, 44, 53, 63] (21%), other methods were used to develop the tools, and in 3 studies [42, 52, 65] (12.5%), the methodology used for tool development was not specified. Regarding implementation platforms, 7 studies [27, 29, 30, 42, 43, 64, 65] (29%) used R software, presenting their tools as packages, dashboards, or toolkits. Three studies [32, 34, 54] (12.5%) used Python, delivering the tools either as software applications or toolkits. Two studies [44, 58] (8%) utilized both R and Python. The tools in 4 studies [28, 35, 52, 63] (17%) were web-based, while another 4 studies [50, 51, 53, 56, 65] (17%) developed SQL-based toolkits. Additionally, 4 studies [48, 55, 57, 66] (17%) used other statistical software for tool development.
Figure 6 is a heatmap that depicts the relationship between the types of tools developed for assessing data quality and the frameworks used in those tools. The framework introduced by Kahn and colleagues has been widely adopted by researchers, particularly in the R software environment.
Fig. 6.
Relationship between the type of data quality tool and the framework used for its development
In order to find an overview of the relationship between the three components of dimensions, methods and tools, these three components are displayed simultaneously in Table 2. In this table, the most frequent dimensions are displayed. The complete table are shown in Supplementary File 4.
Table 2.
Overview of relationship between three components
| Dimension | Method | Tools name | Tools type |
|---|---|---|---|
| completeness |
Based on the ratio of completed fields to the total number of fields for each variable. |
Hannelore Aerts [27] | R-based toolkit [27] |
| DQD [30] | R-based Dashboard [30] | ||
| Coutinho-Almeida [32] | Python-based toolkit [32] | ||
| DQA [43] | R-package [43] | ||
| Odeny [52] | web-based toolkit [52] | ||
| Razzaghi [55] | SAS-based-toolkit [55] | ||
|
(Number of required fields − Number of missing required fields)/Number of required fields |
DQe -c [35] | Toolkit for web-based report [35] | |
| Tahar [58] | R-Python-based software [58] | ||
| Calculation of data volume over time trends [65] | Verma [65] | R-based-Toolkit in GEMINI [65] | |
|
Applying enhanced definitions |
Yahia Mohamed [50] | SQL-based-toolkit [50] | |
| Reimer [56] | SQL-based-toolkit [56] | ||
| daqapo [64] | R-package [64] | ||
|
Utilizing a Rules Database. |
TAQIH [28] | web-based software [28] | |
| Pezoulas [54] | Python-based software [54] | ||
| Wolfgang [57] | spss-based-toolkit [57] | ||
| openCQA [63] | web-based software [63] | ||
| QA program [66] | Toolkit [66] | ||
|
Using statistical methods. |
PEDSnet Data Quality [45] | R-Python-based software [45] | |
| Whan OH [53] | SQl-based-rulebased tool [53] | ||
| plausibility |
Examining the logical coherence of one data element in relation to others. |
DQD [30] | R-based Dashboard [30] |
| DQA [42] | R-based toolkit [42] | ||
| Tahar [58] | R-Python-based software [58] | ||
| daqapo [64] | R-package [64] | ||
| Assessing the believability of values within a variable [42, 43, 58] | DQA [42] | R-based toolkit 42 | |
| Tahar [58] | R-Python-based software [58] | ||
| Applying enhanced definitions [34, 36, 44, 49] | Noah Engel [34] | Python-based toolkit [34] | |
| Utilizing a Rules Database [32, 51, 67, 68] | Coutinho-Almeida [32] | Python-based toolkit [32] | |
| Using statistical methods [29, 32, 52, 54, 55, 65] | mosaicQA [29] | R-package [29] | |
| Coutinho-Almeida [32] | Python-based toolkit [32] | ||
| Odeny [52] | web-based toolkit [52] | ||
| Pezoulas [54] | Python-based software [54] | ||
| Razzaghi [55] | SAS-based-toolkit [55] | ||
| Verma [65] | R-based-Toolkit in GEMINI [65] | ||
| conformance | Evaluating data compliance with predefined formats and relational constraints [30] | DQD [30] | R-based Dashboard [30] |
| Applying enhanced definitions [34, 36, 43, 44, 50] | Noah Engel [34] | Python-based toolkit [34] | |
| DQA [43] | R-package [43] | ||
| Yahia Mohamed [50] | SQL-based-toolkit [50] | ||
| Utilizing a Rules Database [41, 55] | Razzaghi [55] | SAS-based-toolkit [55] | |
| Using statistical methods [32, 62] | Coutinho-Almeida [32] | Python-based toolkit [32] | |
| accuracy |
Calculating the proportion of incorrect, illogical, or implausible values, including biologically unacceptable values. |
DQAT [48] | excel-toolkit [48] |
| Internal validation through repeated measurements [31] | ********* | ********* | |
| Comparison with external gold standards, calculating metrics such as accuracy, sensitivity, and specificity [37] | ********* | ********* | |
| Advanced validation using data pattern analysis [60] | ********* | ********* | |
| Applying enhanced definitions [33] | ********* | ********* | |
| Utilizing a Rules Database [40, 45, 53] | PEDSnet Data Quality [45] | R-Python-based software [45] | |
| Whan OH [53] | SQl-based-rulebased tool [53] | ||
| Using statistical methods [28] | TAQIH [28] | web-based software [28] | |
| correctness | Comparison with external gold standards and Time-Dependent Data Accuracy Profiling. [27, 66] | Hannelore Aerts [27] | R-based toolkit [27] |
| QA program [66] | Toolkit [66] | ||
| Applying enhanced definitions [21, 38, 46, 49] | ********* | ********* | |
| Using statistical methods [57, 59, 63, 65, 66] | Wolfgang [57] | spss-based-toolkit [57] | |
| openCQA [63] | web-based software [63] | ||
| Verma [65] | R-based-Toolkit in GEMINI [65] | ||
| QA program [66] | Toolkit [66] | ||
| consistency | Proportion of field values deviating from expected data types or formats [27, 47] | Hannelore Aerts [27] | R-based toolkit [27] |
| Internal consistency checks through comparison of related indicators [28, 52] | TAQIH [28] | web-based software [28] | |
| Odeny [52] | web-based toolkit [52] | ||
| Applying enhanced definitions [38, 50] | Yahia Mohamed [50] | SQL-based-toolkit [50] | |
| Utilizing a Rules Database [45, 53, 62] | PEDSnet Data Quality [45] | R-Python-based software [45] | |
| Whan OH [53] | SQl-based-rulebased tool [53] | ||
| Using statistical methods [54, 55] | Pezoulas [54] | Python-based software [54] | |
| Razzaghi [55] | SAS-based-toolkit [55] | ||
| Uniqueness | Uniqueness formula: (Total records − Duplicate records)/Total records [27, 58] | Hannelore Aerts [27] | R-based toolkit [27] |
| Tahar [58] | R-Python-based software [58] | ||
| Identification of multiple activities recorded at the same time point for a single individual or source [64] | daqapo [64] | R-package [64] | |
| Utilizing a Rules Database [45, 50, 51, 53, 68] | PEDSnet Data Quality [45] | R-Python-based software [45] | |
| Yahia Mohamed [50] | SQL-based-toolkit [50] | ||
| Yahia Mohamed [51] | SQL-based-toolkit [51] | ||
| Whan OH [53] | SQl-based-rulebased tool [53] | ||
| Using statistical methods [54] | Pezoulas [54] | Python-based software [54] | |
| concordance | using detection of violations in attribute dependencies [55, 64] | Razzaghi [55] | SAS-based-toolkit [55] |
| daqapo [64] | R-package [64] | ||
| Matching with reference values [58] | Tahar [58] | R-Python-based software [58] | |
| applying enhanced definitions [33, 49, 56] | Reimer [56] | SQL-based-toolkit [56] | |
| using statistical methods [65] | Verma [65] | R-based-Toolkit in GEMINI [65] |
Quality assessment of studies: The overall score, reflecting the general quality of each study, was calculated as the average of the scores obtained across all quality assessment questions. 43 studies (97%) introduced DQDs and answered the first research question completely.43 studies (97%) answered second research question and introduced a method to assess DQ. Third research question answered by 21 studies (48%) completely. Following the completion of the quality assessment, the average score of the included studies was 81% based on the first criterion, 52% based on the second criterion, and the overall average score was 67%. This value exceeds the 60% threshold (26), indicating compliance with the QAC criteria. Details of the scoring method based on the second criterion are presented in Table 3. The findings suggest that the included studies provided sufficient information to adequately address the research questions. Further details regarding the individual quality assessment scores for each study are provided in Supplementary File 5.
Table 3.
The quality assessment of studies
| Data Quality Assessment | Quality Assessment Criteria | Answering Score (0, 0.5, 0.75, 1) |
Total Score | |
|---|---|---|---|---|
| Data Quality Assessment |
Dimension Methods tool explained |
Present the dimensions،methods and tools: 24 | 24 × 1 | 24 studies (55%) |
| Present the dimensions and methods: 42 | 42 × 0.75 | 31 studies (72%) | ||
| Present the dimensions and tools: 24 | 24 × 0.75 | 18 studies (41%) | ||
| Present the methods and tools: 24 | 24 × 0.75 | 18 studies (41%) |
Studies that addressed all three components of data quality assessment: dimensions, methods, and tools were assigned a score of 1. Studies that addressed only two components of data quality assessment were given a score of 0.75. Studies that addressed only one component were given a score of 0.5. Studies that did not address any of these components of data quality assessment were given a score of 0.
Discussion
Within the realm of health informatics, specialized areas such as data governance, data management, and data analysis are heavily influenced by data quality. High-quality data directly impacts these components, facilitating organized and informed medical decision-making. This interconnectedness highlights the necessity of maintaining rigorous standards in data quality, which ultimately supports effective healthcare practices [70]. Accordingly, this systematic review study aims to explore the definitions and dimensions of data quality within the healthcare field, identifying key dimensions, appropriate assessment methods and tools to evaluate the quality of healthcare data.
Our systematic review identified a diverse range of dimensions, methods, and tools for assessing data quality in healthcare. While our findings highlight the multifaceted nature of data quality assessment, a notable emphasis on certain dimensions and their crucial role in ensuring that healthcare data is not only accurate but also fit for its intended purpose.
Among these, the completeness dimension was the most extensively evaluated. This emphasis may be attributed not only to the inherent importance of this dimension but also to the relative simplicity of its calculation. However, recent studies have approached the evaluation of completeness from a more nuanced and rigorous perspective, leading to more accurate and realistic assessments of this dimension.
For instance, the works by Wurster et al. [71] provide a comprehensive systematic review on the changes in documentation due to the introduction of electronic patient records, primarily focusing on the aspect of completeness. Their findings reinforce the need for robust methods to assess whether all necessary data elements are present and adequately recorded within EHRs. This is particularly crucial given the direct impact of missing or incomplete data on patient safety and the reliability of secondary data use.
Furthermore, the longitudinal and comparative document analyses conducted by Wurster et al. in German hospitals [72, 73] offer valuable empirical evidence on the practical implications of EHR implementation on data completeness. These studies demonstrate that while EHRs generally improve data completeness, the extent and consistency of these improvements can vary across different clinical departments and stages of adoption. Such empirical observations underscore the complexity of achieving and maintaining high data quality and suggest that the mere implementation of an EHR system does not automatically guarantee optimal data completeness. Instead, ongoing monitoring and targeted interventions are required to address specific areas of deficiency. The methodologies employed in these studies, particularly their focus on direct document analysis for assessing completeness, serve as important examples of practical data quality assessment approaches.
In the reviewed studies, the various dimensions did not have standardized definitions. At times, different definitions were used for the same dimension (polysemy), while in other cases, identical definitions were referred to by different names (synonymy). Furthermore, there was considerable overlap in the definitions of dimensions across different studies. For example, the dimensions of accuracy and correctness exhibit overlap, as do currency and timeliness, which are often defined in similar ways. In some studies, certain dimensions have also been considered as proxies for others. For instance, conformance and concordance can be regarded as proxies for plausibility, or alternatively, plausibility may be considered a proxy for consistency. In summary, although there is a general consensus among researchers regarding the existence of various DQDs, there is no collective agreement on the exact definitions or the number of these dimensions. Consequently, there is no standardized framework for defining each dimension. As a result, DQDs exhibit considerable variability, both quantitatively and qualitatively, across different studies.
The methods used to assess DQDs in the reviewed studies for this systematic review ranged from simple calculation formulas to advanced techniques such as various machine learning algorithms. Our findings suggest that, while some included studies applied AI-enhanced techniques, such approaches were often underreported or not explicitly labeled as AI-based methods. In this systematic review a small percentage of the studies (about 5%) included used these methods. It can have various reasons. One of the reasons that can be mentioned for this issue is that more complex AI-based DQA methodologies were not seriously introduced in the healthcare area and still researchers have presented the method introduced by famous people in this field such as Kahn and Weiskopf for quality assessment. These advanced and AI-based methods could offer enhanced scalability and adaptability, especially in large, heterogeneous datasets. Additionally, ML-based imputations for missing data and natural language processing (NLP) for assessing the quality of unstructured clinical text could have emerged as powerful tools in this domain. It is worth mentioning that the critical role of data quality in ensuring trustworthy AI models in medicine and reinforcing the bidirectional relationship between data quality and AI should be highlighted. high-quality data is essential for training robust models, while AI can itself be leveraged to assess and improve data quality. Future research and tool development should more clearly define, classify, and evaluate these emerging strategies to bridge the gap between methodological innovation and practical implementation. However, more than one-third of the studies included in this review [30, 32, 34–36, 39–41, 43, 51, 58, 60–62, 66, 68] used the framework introduced by Kahn et al. as the foundation for their work.
The tools introduced in the reviewed studies varied in several aspects, such as their objectives and applications, and each had its own unique features. For example, a tool for tabular data quality assessment and improvement in the context of health data (TAQIH) [28] is a web-based application designed to support exploratory data analysis processes, with a particular focus on DQA and the provision of semi-automated data quality improvement. Its primary goal is to assist non-technical healthcare professionals in assessing and improving data quality during the EDA process. In contrast, mosaicQA [29] is a library within the R software environment that enables researchers to generate reports for various types of metric and categorical data without requiring computational or programming expertise. Its purpose is to facilitate basic DQA and to produce a wide range of graphical outputs, without necessitating deep experience in statistical methods or statistical software. This library enables epidemiological researchers to gain an overall understanding and insight into their data without prior knowledge or proficiency in R.
Several other systematic reviews have also explored the topic of data quality. Weiskopf et al. [74] expanded on their 2013 review with a 2023 update, focusing on EHR DQA. Their study, based solely on the PubMed database, abstracted two additional dimensions of data quality and an additional methodology than the 2013 study. In this review study, seven data quality dimensions and eight methods of assessment have been identified and the relationships between the dimensions and methods have been shown. This study did not address the existing tools for quality assessment. Obinwa Ozonze [75] and colleagues also conducted a systematic review in 2023, focusing on the operationalization of EHR DQA mainly automated tooling, and highlights necessary considerations for future implementations. They reviewed 23 articles published between 2011 and 2021.This study has examined the DQA programs implemented (tools) and DQDs that the tool evaluates. In other words, the focus of this review is on dimensions and tools, and has not mentioned much about the DQA methods. In 2024, Jens Declerck et al. [76] performed a review of reviews, aiming to integrate existing frameworks related to DQDs and assessment methods in the secondary use of health data. In addition, this study has aimed to consolidate the results into a unified framework. Their focus was primarily on dimensions and methods, not on available tools. Whereas, in this study, we sought to focus not only on the tools for DQA but also on the dimensions being evaluated and the methods of evaluation employed.
Our goal of this study was to take the first steps to design a comprehensive conceptual framework for evaluating the quality of health data. To have a framework for data quality assessment, the dimensions of data quality must first be considered. In the second step, each of the dimensions of their evaluation methods should be presented, and finally, the existing and developed tools for implementing the methods should be introduced.
For this purpose, in order to discover and explore with the concepts that make up this framework and the relationship between them, we evaluated the three component dimensions, evaluation methods and tools used for data evaluation simultaneously, so that we can provide the insight an overview of a framework for the reader.
Therefore, the present review covered structured data using three databases and aimed to provide a more detailed classification of dimensions and assessment methods, enhancing usability and understanding for a broader audience. In addition, we aimed to provide a comprehensive overview of key criteria for DQA by illustrating, through a heatmap, the relationship between various dimensions of data quality and the evaluation methods associated with each. Additionally, by analyzing the tools presented in the included studies, we were able to partially uncover the connection between the frameworks used for tool design and the types of tools developed. We believe that the heatmap presented in the Fig. 6 offer valuable insights for researchers and data analysts in evaluating data quality more effectively.
Several foundational frameworks have shaped the conceptualization of data quality in healthcare, each emphasizing different dimensions and perspectives. For instance, the harmonized framework by Kahn et al. [39] focuses on system-level validation and operationalizes three core dimensions: conformance, completeness, and plausibility. In contrast, Weiskopf et al. [20] proposed a more user-centric 3 × 3 model, mapping three dimensions (completeness, correctness, and currency) across three contexts (patient, variable, and time), thereby offering a practical guide for fitness-for-use evaluations. Additionally, Wang and Strong’s classic framework [9], though less frequently cited in clinical settings, introduced a broader intrinsic, contextual, representational, and accessibility dimensions, that informs evaluations beyond technical accuracy. Our findings show that the included studies drew variably from these conceptual models. This diversity underscores the lack of a unified vocabulary or shared evaluative framework in the field. While we do not propose a new framework, our synthesis—linking dimensions, methods, and tools—aims to provide a structural map that can inform future efforts toward framework harmonization and more transparent comparative evaluations.
In the majority of the 44 studies reviewed in this systematic review, the sequence and prioritization of DQDs in the DQA process were not specified. It appears that addressing this issue is essential for achieving a more accurate evaluation of data quality. In other words, certain DQDs should be considered as prerequisites for assessing others. This approach can help prevent redundant operations and reduce the overall time required for the evaluation process. Therefore, investigating the appropriate sequencing of DQDs in the assessment process may serve as a valuable direction for future research. Additionally, the development of a practical framework for the assessment of structured clinical data quality—as a guideline to support data processing workflows—could also be considered in subsequent studies.
Conclusions
As healthcare systems increasingly rely on digital data for clinical decision-making, research, and operational management, ensuring the quality of that data becomes critical. This systematic review highlights the multifaceted nature of data quality in healthcare and emphasizes the importance of adopting a structured, standardized approach to its assessment. We identified a wide range of dimensions used to evaluate data quality, with completeness, plausibility, and conformance emerging as the most frequently addressed. However, considerable variability in terminology, definitions, and methodological approaches was evident across the literature. Our findings reveal that while multiple tools and frameworks exist to support DQA, their implementation varies greatly, often depending on the intended use, data structure, and the expertise of end users. The lack of consistency in defining and operationalizing DQDs further complicates efforts to generalize or replicate findings across studies and settings. To address these challenges, we recommend the development of a practical framework that standardizes the definitions, assessment methods, and tool design processes for evaluating healthcare data quality. This practical framework could be used as a guide to researcher in clinical context to help them to assess the quality of data and improve data analysis and decision making. Future research should also explore the optimal sequencing and prioritization of dimensions to improve assessment efficiency.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
Not applicable.
Abbreviations
- DQA
Data Quality Assessment
- DQD
Data Quality Dimension
- ETL
Extract, Transform, Load
- HIS
Healthcare Information System
- HER
Electronic Health Record
- HISRA
Hospital Information Software Recording Ability
- QAC
Quality Assessment Criteria
- PRISMA
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
Authors’ contributions
E.H. contributed by conducting a comprehensive review, screened relevant studies and performed data extraction, creating figures and diagrams to support the article’s structure, analysis, and interpretation of data. M.A. was responsible for organizing the content, editing the original draft, screened relevant studies and performed data extraction, and designing visual elements. M.M. organized the content, performed data extraction and edited the manuscript. H.T. collaborated on designing the data extraction checklist, conducted a pilot check on data extraction for some papers, offered feedback on all manuscript versions, substantively revised the manuscript, and supervised this project. All authors read and approved the final manuscript.
Funding
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Data availability
No datasets were generated or analysed during the current study.
Declarations
Ethics approval and consent to participate
This study was assessed by the research council of Mashhad University of Medical Sciences (Reference Number: IR.MUMS.REC.1402.066). The study was approved because no identifying data have been reported.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Steffen GE. Quality medical care: A definition. JAMA. 1988;260(1):56–61. [PubMed] [Google Scholar]
- 2.M H. Information quality in health care organization. 2017.
- 3.Harvey L, Green D. Defining quality. assessment & evaluation in higher education. 1993;18(1):9–34.
- 4.De Feo JA. Juran’s quality handbook: the complete guide to performance excellence. McGraw-Hill Education New York; 2017. [Google Scholar]
- 5.Hoyer RW, Hoyer BB, Crosby PB, Deming WE. What is quality. Qual Prog. 2001;34(7):53–62. [Google Scholar]
- 6.https://www.fao.org/4/w7295e/w7295e03.htm. what is quality 2024.
- 7.Organization WH. Improving data quality: A guide for developing countries. Improving data quality: A guide for developing countries. 2003.
- 8.Elshaer I. What is the meaning of quality? 2012.
- 9.Wand Y, Wang RY. Anchoring data quality dimensions in ontological foundations. Commun Of The ACM. 1996;39(11):86–95. [Google Scholar]
- 10.Redman TC. Improve data quality for competitive advantage. MIT Sloan Manag Rev. 1995.
- 11.Wang RY, Strong DM. Beyond accuracy: What data quality means to data consumers. Journal Of Management Information Systems. 1996;12(4):5–33. [Google Scholar]
- 12.Richesson RL, Andrews JE, Hollis KF. Clinical research informatics. Springer; 2012. [Google Scholar]
- 13.Hersh WR, Weiner MG, Embi PJ, Logan JR, Payne PR, Bernstam EV, et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care. 2013;51:S30–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Botsis T, Hartvigsen G, Chen F, Weng C. Secondary use of EHR: Data quality issues and informatics opportunities. Summit On Translat Bioinforma. 2010, 2010;1. [PMC free article] [PubMed] [Google Scholar]
- 15.McCarty CA, Chisholm RL, Chute CG, Kullo IJ, Jarvik GP, Larson EB, et al. The eMERGE network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011;4:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Safran C, Bloomrosen M, Hammond WE, Labkoff S, Markel-Fox S, Tang PC, Detmer DE. Toward a national framework for the secondary use of health data: An American medical informatics association white paper. J Am Med Inf Assoc. 2007;14(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Tannen RL, Weiner MG, Xie D. Replicated studies of two randomized trials of angiotensin-converting enzyme inhibitors: Further empiric validation of the ‘prior event rate ratio’to adjust for unmeasured confounding by indication. Pharmacoepidemiol Drug Saf. 2008;17(7):671–85. [DOI] [PubMed] [Google Scholar]
- 18.Tannen RL, Weiner MG, Xie D. Use of primary care electronic medical record database in drug efficacy research on cardiovascular outcomes: Comparison of database and randomised controlled trial findings. BMJ. 2009;338. [DOI] [PMC free article] [PubMed]
- 19.Kilkenny MF, Robinson KM. Data quality: “garbage in-garbage out”. London, England: SAGE Publications Sage UK; 2018. p. 103–05. [DOI] [PubMed] [Google Scholar]
- 20.Weiskopf NG, Bakken S, Hripcsak G, Weng C. A data quality assessment guideline for electronic health record data reuse. Egems. 2017;5(1):14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.de Lusignan S, van Weel C. The use of routinely collected computer data for research in primary care: Opportunities and challenges. Fam Pract. 2006;23(2):253–63. [DOI] [PubMed] [Google Scholar]
- 22.Finnell JT, Overhage JM, Grannis S, editors. All health care is not local: An evaluation of the distribution of emergency department care delivered in Indiana. AMIA Annual Symposium Proceedings. 2011. [PMC free article] [PubMed]
- 23.Johnson SG, Speedie S, Simon G, Kumar V, Westra BL. Application of an ontology for characterizing data quality for a secondary use of EHR data. Appl Clin Inf. 2016;7(1):69–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Education. MoHaM. Quality of electronic health record data in health service information messages 2015. 2015.
- 25.Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. BMJ. 2009;339. [PMC free article] [PubMed]
- 26.Yang L, Zhang H, Shen H, Huang X, Zhou X, Rong G, Shao D. Quality assessment in systematic literature reviews: A software engineering perspective. Inf Softw Technol. 2021;130:106397. [Google Scholar]
- 27.Aerts H, Kalra D, Sáez C, Ramírez-Anguita JM, Mayer M-A, Garcia-Gomez JM, et al. Quality of hospital electronic health record (EHR) data based on the international consortium for health outcomes measurement (ICHOM) in heart failure: Pilot data quality assessment study. JMIR Med Inf. 2021;9(8):e27842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.R ÁS, Beristain Iraola A, Epelde Unanue G, Carlin P. TAQIH, a tool for tabular data quality assessment and improvement in the context of health data. Computer methods and programs in biomedicine. 2019;181:104824. [DOI] [PubMed]
- 29.Bialke M, Rau H, Schwaneberg T, Walk R, Bahls T, Hoffmann W. mosaicQA-a general approach to facilitate basic data quality assurance for epidemiological research. Methods Inf Med. 2017;56(S 01):e67–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Blacketer C, Defalco FJ, Ryan PB, Rijnbeek PR. Increasing trust in real-world evidence through evaluation of observational data quality. J Am Med Inf Assoc. 2021;28(10):2251–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Carsley S, Birken CS, Parkin PC, Pullenayegum E, Tu K. Completeness and accuracy of anthropometric measurements in electronic medical records for children attending primary care. 2018. [DOI] [PubMed]
- 32.Coutinho-Almeida J, Saez C, Correia R, Rodrigues PP. Development and initial validation of a data quality evaluation tool in obstetrics real-world data through HL7-FHIR interoperable Bayesian networks and expert rules. JAMIA Open. 2024;7(3):ooae062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Diaz-Garelli J-F, Bernstam EV, Lee M, Hwang KO, Rahbar MH, Johnson TR. DataGauge: A practical process for systematically designing and implementing quality assessments of repurposed clinical data. eGEMs. 2019;7(1):32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Engel N, Wang H, Jiang X, Lau CY, Patterson J, Acharya N, et al. EHR data quality assessment tools and issue reporting workflows for the ‘all of Us’ research program clinical data research network. AMIA Summits On Transl Sci Proc. 2022, 2022;186. [PMC free article] [PubMed] [Google Scholar]
- 35.Estiri H, Stephens KA, Klann JG, Murphy SN. Exploring completeness in clinical data research networks with DQe-c. J Am Med Inf Assoc. 2018;25(1):17–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Henley-Smith S, Boyle D, Gray K. Improving a secondary use health data warehouse: Proposing a multi-level data quality framework. eGEMs. 2019;7(1):38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Huang Y, Voorham J, Haaijer-Ruskamp FM. Using primary care electronic health record data for comparative effectiveness research: Experience of data quality assessment and preprocessing in the Netherlands. J Comp Eff Res. 2016;5(4):345–54. [DOI] [PubMed] [Google Scholar]
- 38.Johnson SG, Speedie S, Simon G, Kumar V, Westra BL. A data quality ontology for the secondary use of EHR data. AMIA Annual Symposium Proceedings. 2015. [PMC free article] [PubMed]
- 39.Kahn MG, Callahan TJ, Barnard J, Bauck AE, Brown J, Davidson BN, et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. Egems. 2016;4(1):1244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kahn MG, Raebel MA, Glanz JM, Riedlinger K, Steiner JF. A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. Med Care. 2012;50:S21–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kamdje-Wabo G, Gradinger T, Löbe M, Lodahl R, Seuchter SA, Sax U, Ganslandt T. Towards structured data quality assessment in the German medical informatics initiative: initial approach in the MII demonstrator study. In: MEDINFO 2019: Health and wellbeing e-networks for. All: IOS Press; 2019. p. 1508–09. [DOI] [PubMed] [Google Scholar]
- 42.Kapsner LA, Kampf MO, Seuchter SA, Kamdje-Wabo G, Gradinger T, Ganslandt T, et al. Moving towards an EHR data quality framework: the MIRACUM approach. In: German medical data sciences: Shaping change-creative solutions for innovative medicine. IOS Press; 2019. p. 247–53. [DOI] [PubMed] [Google Scholar]
- 43.Kapsner LA, Mang JM, Mate S, Seuchter SA, Vengadeswaran A, Bathelt F, et al. Linking a consortium-wide data quality assessment tool with the MIRACUM metadata repository. Appl Clin Inf. 2021;12(4):826–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Khare R, Utidjian LH, Razzaghi H, Soucek V, Burrows E, Eckrich D, et al. Design and refinement of a data quality assessment workflow for a large pediatric research network. eGEMs. 2019;7(1):36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Kim K-H, Choi W, Ko S-J, Chang D-J, Chung Y-W, Chang S-H, et al. Multi-left healthcare data quality measurement model and assessment using omop cdm. Appl Sci. 2021;11(19):9188. [Google Scholar]
- 46.Kiogou SD, Chi C-L, Zhang R, Ma S, Adam TJ. Clinical data cohort quality improvement: The case of the medication data in the University of Minnesota’s clinical data repository. AMIA Summits on Translational Science Proceedings. 2022, 2022: 293. [PMC free article] [PubMed]
- 47.Kookal KK, Walji MF, Brandon R, Kivanc F, Mertz E, Kottek A, et al. Systematically assessing the quality of dental electronic health record data for an investigation into oral health care disparities. J Public Health Dent. 2024;84(3):242–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Laberge M, Shachak A. Developing a tool to assess the quality of socio-demographic data in community health centres. Appl Clin Inf. 2013;4(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Lyons AM, Dimas J, Richardson SJ, Sward K. Assessing EHR data for use in clinical improvement and research. AJN The Am J Nurs. 2022;122(6):32–41. [DOI] [PubMed] [Google Scholar]
- 50.Mohamed Y, Song X, McMahon TM, Sahil S, Zozus M, Wang Z, et al. Electronic health record data quality variability across a multistate clinical research network. J Clin Transl Sci. 2023;7(1):e130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Mohamed Y, Song X, McMahon TM, Sahil S, Zozus M, Wang Z, Waitman LR. Tailoring rule-based data quality assessment to the patient-lefted outcomes research network (PCORnet) Common Data Model (CDM). AMIA Annual Symposium Proceedings. 2023. [PMC free article] [PubMed]
- 52.Odeny BM, Njoroge A, Gloyd S, Hughes JP, Wagenaar BH, Odhiambo J, et al. Development of novel composite data quality scores to evaluate facility-level data quality in electronic data in Kenya: A nationwide retrospective cohort study. BMC Health Serv Res. 2023;23(1):1139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Oh SW, Ko SJ, Im YS, Jung S, Choi BY, Kim JY, et al. Data quality assessment for observational medical outcomes partnership common data Model of multi-left. In: Caring is sharing-exploiting the value in Data for health and innovation. IOS Press; 2023. p. 322–26. [DOI] [PubMed] [Google Scholar]
- 54.Pezoulas VC, Kourou KD, Kalatzis F, Exarchos TP, Venetsanopoulou A, Zampeli E, et al. Medical data quality assessment: On the development of an automated framework for medical data curation. Comput Biol Med. 2019;107:270–83. [DOI] [PubMed] [Google Scholar]
- 55.Razzaghi H, Goodwin Davies A, Boss S, Bunnell HT, Chen Y, Chrischilles EA, et al. Systematic data quality assessment of electronic health record data to evaluate study-specific fitness: Report from the PRESERVE research study. PLOS Digit Health. 2024;3(6):e0000527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Reimer AP, Milinovich A, Madigan EA. Data quality assessment framework to assess electronic medical record data for use in research. Int J Med Inf. 2016;90:40–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Rödle W, Prokosch H-U, Neumann E, Toni I, Haering-Zahn J, Neubert A, Eberl S. Creating a medication therapy observational research database from an electronic medical record: Challenges and data curation. Appl Clin Inf. 2024;15(1):111–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Tahar K, Martin T, Mou Y, Verbuecheln R, Graessner H, Krefting D. Rare diseases in hospital information systems-an interoperable methodology for distributed data quality assessments. Methods Inf Med. 2023;62(3/04):071–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Terry AL, Stewart M, Cejic S, Marshall JN, de Lusignan S, Chesworth BM, et al. A basic model for assessing primary health care electronic medical record data quality. Bmc Med Inform Decis. 2019;19:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Thuraisingam S, Chondros P, Dowsey MM, Spelman T, Garies S, Choong PF, et al. Assessing the suitability of general practice electronic health records for clinical prediction model development: A data quality assessment. Bmc Med Inform Decis. 2021;21:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Representing rules for clinical data quality assessment based on OpenEHR guideline definition language. In: Tian Q, Han Z, An J, Lu X, Duan H, editors. MedInfo. 2019. [DOI] [PubMed]
- 62.Tian Q, Han Z, Yu P, An J, Lu X, Duan H. Application of openEHR archetypes to automate data quality rules for electronic health records: A case study. Bmc Med Inform Decis. 2021;21:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Tute E, Scheffner I, Marschollek M. A method for interoperable knowledge-based data quality assessment. BMC medical informatics and decision making. 2021;21:1–14. [DOI] [PMC free article] [PubMed]
- 64.Vanbrabant L, Martin N, Ramaekers K, Braekers K. Quality of input data in emergency department simulations: Framework and assessment techniques. Simul Modell Pract Theory. 2019;91:83–101.
- 65.Verma AA, Pasricha SV, Jung HY, Kushnir V, Mak DY, Koppula R, et al. Assessing the quality of clinical and administrative data extracted from hospitals: The general medicine inpatient initiative (GEMINI) experience. J Am Med Inf Assoc. 2021;28(3):578–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Walker KL, Kirillova O, Gillespie SE, Hsiao D, Pishchalenko V, Pai AK, et al. Using the CER hub to ensure data quality in a multi-institution smoking cessation study. J Am Med Inf Assoc. 2014;21(6):1129–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Wang H, Belitskaya-Levy I, Wu F, Lee JS, Shih M-C, Tsao PS, et al. A statistical quality assessment method for longitudinal observations in electronic health record data with an application to the VA million veteran program. Bmc Med Inform Decis. 2021;21:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Wang Z, Talburt JR, Wu N, Dagtas S, Zozus MN. A rule-based data quality assessment system for electronic health record data. Appl Clin Inf. 2020;11(4):622–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Zabolinezhad H, Eslami S, Hassibian MR, Dorri S. Assessing the quality of electronic medical records in academic hospitals: A multi-left study in Iran. Front Digit Health. 2022;4:856010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Gadd CS, Steen EB, Caro CM, Greenberg S, Williamson JJ, Fridsma DB. Domains, tasks, and knowledge for health informatics practice: Results of a practice analysis. J Am Med Inf Assoc. 2020;27(6):845–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Wurster F, Fütterer G, Beckmann M, Dittmer K, Jaschke J, Koeberlein-Neu J, et al. The analyzation of change in documentation due to the introduction of electronic patient records in hospitals-a systematic review. J Med Syst. 2022;46(8):54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Wurster F, Beckmann M, Cecon-Stabel N, Dittmer K, Hansen TJ, Jaschke J, et al. The implementation of an electronic medical record in a German hospital and the change in completeness of documentation: Longitudinal document analysis. JMIR Med Inf. 2024;12:e47761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Wurster F, Herrmann C, Beckmann M, Cecon-Stabel N, Dittmer K, Hansen T, et al. Differences in changes of data completeness after the implementation of an electronic medical record in three surgical departments of a German hospital-a longitudinal comparative document analysis. Bmc Med Inform Decis. 2024;24(1):258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Lewis AE, Weiskopf N, Abrams ZB, Foraker R, Lai AM, Payne PR, Gupta A. Electronic health record data quality assessment and tools: A systematic review. J Am Med Inf Assoc. 2023;30(10):1730–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Ozonze O, Scott PJ, Hopgood AA. Automating electronic health record data quality assessment. J Med Syst. 2023;47(1):23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Declerck J, Kalra D, Vander Stichele R, Coorevits P. Frameworks, dimensions, definitions of aspects, and assessment methods for the appraisal of quality of health data for secondary use: Comprehensive overview of reviews. JMIR Med Inf. 2024;12(1):e51560. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
No datasets were generated or analysed during the current study.





