Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jul 7.
Published in final edited form as: Stud Health Technol Inform. 2025 May 15;327:517–521. doi: 10.3233/SHTI250391

A customizable data quality tool for global observational research networks

Judith LEWIS a, Brenna HOGAN b, Charles GOSS c, Dhanushi RUPASINGHE d, Fernanda MARURI a, Kalongo HAMUSONDE e, Savannah OBREGON a, Mansi AGARWAL c, Awachana JIAMSAKUL d, Megan TURNER a, Austin KATONA a, Keri ALTHOFF b, Stephany N DUDA a
PMCID: PMC12233329  NIHMSID: NIHMS2089625  PMID: 40380501

Abstract

Evaluating data quality is essential when combining multi-site observational clinical data for analysis. We collaborated with five research networks, representing various data approaches and workflows, to generalize an established data quality checking and report generation tool so it could be implemented more easily by other research consortia. The resulting approach reduced the need for technical expertise at user sites by leveraging the REDCap data collection software to store details about a research group, their data model, and expectations about variables (e.g., plausible numeric range, valid format and codes, date logic). The application then used the REDCap API to retrieve those details and assess a dataset’s conformance to the data model, logical consistency, and completeness. Users could download reports that summarized the dataset contents and quality. The generalized Harmonist Data Toolkit was built using the freely available REDCap and R/Shiny platforms, with code available on GitHub. All five collaborating consortia found the Toolkit beneficial in detecting inconsistencies and providing informative data reports and visualizations. The Harmonist Data Toolkit fills a need for data quality and report generation solutions for consortia without local programming expertise.

Keywords: Datasets as topic, observational studies, data quality, software

1. Introduction

Research networks that collect observational health data from a variety of patient populations are invaluable resources for informing public health policy. Timely and impactful analyses of the collective data are possible if the data are of high quality and can be aligned to a common data model. Researchers have developed advanced tools to support data quality checking for large EHR-extracted datasets conforming to PCORnet, OMOP, or FHIR data models [14], but smaller research networks with their own data models often lack the resources to implement such complex standards and tools.

The International epidemiology Databases to Evaluate AIDS (IeDEA) is an observational HIV research consortium that brings together data from almost 400 HIV care and treatment sites in 44 countries using the IeDEA Data Exchange Standard (DES) format [5,6]. To address data harmonization challenges within IeDEA, we developed the IeDEA Harmonist Data Toolkit, an interactive web-based data quality checking and report generation application written using the open-source R/Shiny framework [7]. Since its launch in 2019, use of the Toolkit within IeDEA has streamlined access to analysis-ready data by detecting missing and illogical data, tracking adherence to the IeDEA DES, and highlighting gaps in site-level data reporting. As a web-based R/Shiny application, the IeDEA Harmonist Data Toolkit does not require user programming expertise and can be hosted on a cloud-based server or locally on a personal computer. The Toolkit’s data quality checks are based on descriptors of the consortium’s common data model (metadata) that are stored in a paired REDCap database. Once a user defines the data model via the installable REDCap web forms [8], the Toolkit software can access those metadata details via the REDCap API.

Presentations and publications about the IeDEA Harmonist Data Toolkit prompted several other consortia to request a copy of the Toolkit software. In response, we generalized the design of the software to make it available to groups in need of low-code data quality solutions. We collaborated with five international infectious diseases research consortia to learn about their data quality needs, extend the Toolkit feature set, and test the software. The participating consortia included RePORT International, the Regional Prospective Observational Research in Tuberculosis consortium [9]; NA-ACCORD, the North American AIDS Cohort Collaboration on Research and Design [10]; HLB-SIMPLe, the Alliance for Heart, Lung, and Blood Co-morbiditieS Implementation Models in People Living with HIV [11]; TAHOD-CC, the TREAT Asia HIV Observational Database Continuum of Care study [12]; and HEPSANET, the Hepatitis B in Africa Collaborative Network [13]. In this paper, we share the steps involved in generalizing the IeDEA Harmonist Data Toolkit, lessons learned through collaboration, and the resulting publicly available software.

2. Methods

Although the original version of the Toolkit created for IeDEA was designed to apply data quality checks based on data model details stored in REDCap (e.g., date checks for date variables, numeric checks for numeric variables), some of the code was specific to the IeDEA consortium. The first step toward creating an easily-adapted Toolkit was to generalize the Toolkit application by abstracting details about the research network—everything from the network name and logo to patient age groups of interest for dataset reports—into an additional REDCap metadata template.

Toolkit developers met with the data teams from each consortium to understand their network’s common data model and harmonization challenges. We also requested sample (mock) datasets for testing. We guided the data teams from each network in entering key details specific to their networks into the newly created REDCap metadata form (e.g., data model table, variable names, code lists, variables forming the composite primary key of each table, age groups of interest, consortium logo). The REDCap forms also collected descriptors of each variable in the common data model, such as expected data format, plausible value ranges, valid codes, and date logic relationships.

We tested the Toolkit using the mock data provided by each consortium partner and reviewed the results with the data teams. Based on feedback from the data teams and challenges encountered during the metadata entry process, the Harmonist team revised the Toolkit code and REDCap templates to add new options, enhancing the flexibility in customizing the Toolkit.

Each consortium partner’s data team received an installation packet that included instructions for installing R and R/Shiny, the Toolkit code, and REDCap API tokens that linked the Toolkit to their consortium-specific metadata. Figure 1 depicts the steps involved in setting up and using a custom Harmonist Data Toolkit. The REDCap data dictionaries in CSV format, the open-source Toolkit code, and the corresponding documentation are available on GitHub (https://github.com/IeDEA/Harmonist).

Figure 1.

Figure 1.

Steps involved in the setup (A,B) and use (C,D) of a custom Harmonist Data Toolkit.

3. Results

Upon receiving the installation packet, each of the five consortia successfully installed their copy of the Harmonist Data Toolkit: four groups ran the software locally on individual computers and one group’s Toolkit was installed as a web application on a cloud server. Each data team used their Toolkit to process patient-level data to identify data quality issues and generate reports. Common errors detected included duplicate values, incorrectly formatted dates, text in numeric fields, missing data, illogical dates such as clinic visit dates after a patient’s death date, and extraneous variables and codes not defined in the common data model. Summary statistics in the reports provide an additional check for data quality issues that do not violate the data model, such as a drop-off in the number of deaths in the last year of data collection. Data managers from all consortia reported that using the Toolkit to process their data saved time in data cleaning and generated informative data reports and visualizations.

The unique needs of each consortium led to important modifications of the REDCap templates and Toolkit software in three key categories: (1) data model representation, e.g., expanding how consortium data models could be mapped in the REDCap templates, (2) technical setup, supporting modular features and single-user, localhost installations instead of cloud server setups, and (3) custom reporting functionality, which recognized that research consortia had differing needs for data visualization depending on the focus of their research and the role the Toolkit played in their data workflow. Specific examples and the resulting software modifications are described in Table 1.

Table 1.

Key challenges and resolutions in generalizing the Harmonist Data Toolkit software.

Category Challenge Revision to Generalize Data Toolkit
Data Model
Variable naming conventions The IeDEA consortium used the variable suffixes “_SD” and “_ED” to indicate start and end dates of events with potential duration, such as medication regimens or hospitalizations. Other common data models did not use the same variable naming convention. We added in REDCap the option to specify the ending text strings that indicate start dates and end dates. The Toolkit automatically applies a date logic check to any variable pair that begin with identical text and end in those strings (e.g., “med_stdt” should be before “med_enddt”).
Date formatting The original Toolkit code assumed that all dates would be in the ISO 8601 format (YYYY-MM-DD). However, one partner consortium collected dates in DD/MM/YYYY format while another had separate variables for the year, month, and day components of each date. We revised the REDCap templates to allow users to specify their global date format and added an option to indicate that a specific variable was a date component (separate year, month, or day number). The Toolkit code was revised to reconstitute those dates behind the scenes for date logic checks.
Automated metadata import Other consortia had data models that included REDCap forms such as validated instruments and surveys. We developed a script to convert the data dictionary of a REDCap survey into the variable list and codes list used by the Harmonist Toolkit REDCap templates.
Technical Setup
Customization of data quality checks Some consortia did not want the full battery of data quality checks that had been implemented for the original Toolkit. Others wanted the option to program custom checks. A setup menu was added that allowed users to choose the data quality checks of interest. We also created a sample function to allow straightforward inclusion of custom data quality checks.
Cloud vs. local installation Consortia wanted to avoid the cost and cybersecurity requirements of cloud hosting by running the Toolkit on a single user’s computer. We added documentation and support materials to our GitHub repository for local and cloud server installation.
Custom Reports
Patient grouping for reports Specifying a single patient grouping variable (e.g., site) was problematic for two consortia: one stored “site ID” across multiple variables and another uploaded single sites and did not want a grouping variable. We developed a function to allow the grouping variable to be calculated as a combination of data model variables. In the REDCap templates that specified Toolkit functionality, we also allowed users to disable the grouping feature.
Customization of report content Each group can select variables for date histograms and choose patient characteristics to include in dataset summary tables, but other report content was desired. For consortia with R programming expertise, a framework was added to create custom visualizations and tables which are automatically included in the reproducible reports.

4. Discussion and Conclusion

Working with five international, multi-site infectious diseases research consortia, we expanded the functionality of the Harmonist Data Toolkit to meet the data needs of different research teams. The resulting software does not require knowledge of R or programming experience and can accommodate a variety of data models. Key limitations of our software include needing access to REDCap to install the templates and having a data model that can be defined as tables, variables, and code lists. Indeed, highly abstracted data models that require substantial data pre-processing before quality checks can be run, or data models with thousands of variables would not be a good fit for the Toolkit given computational limits of R/Shiny apps. While use of the basic Toolkit does not require programming expertise, incorporating custom data quality checks and report content necessitates knowledge of the R programming language. Future work to address Toolkit limitations include linking to the BioPortal API [14], creating a Shiny application that generates custom data quality check code interactively, adding indicators of data recency to the Toolkit data quality metrics, allowing users to specify thresholds of acceptable levels of each type of error, and improving online Toolkit documentation.

In summary, generalizing the Toolkit software for adaptation by consortia with different data workflows has resulted in a publicly available, easily customizable data quality and report generation tool.

Acknowledgements:

This work was supported by US NIAID/NIH under R24AI124872, as well as U01AI069918 (NA-ACCORD), U01AI174268 (RePORT International), U01AI069907 (TAHOD), U24HL154426 (HLB-SIMPLe), and by the European Association for the Study of the Liver and the John C. Martin Foundation (HEPSANET). The content is solely the responsibility of the authors and does not necessarily represent the official views of any of the governments or institutions mentioned.

References

  • [1].Struckmann S, Mariño J, Kasbohm E, Salogni E, Schmidt CO. dataquieR 2: An updated R package for FAIR data quality assessments in observational studies and electronic health record data. J Open Source Softw 2024;9:6581. 10.21105/JOSS.06581. [DOI] [Google Scholar]
  • [2].Prud’hommeaux E, Collins J, Booth D, Peterson KJ, Solbrig HR, Jiang G. Development of a FHIR RDF Data Transformation and Validation Framework and Its Evaluation. J Biomed Inform 2021;117:103755. 10.1016/J.JBI.2021.103755. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Mohamed Y, Song X, McMahon TM, Sahil S, Zozus M, Wang Z, et al. Tailoring Rule-Based Data Quality Assessment to the Patient-Centered Outcomes Research Network (PCORnet) Common Data Model (CDM). AMIA Annual Symposium Proceedings 2023;2022:775. [PMC free article] [PubMed] [Google Scholar]
  • [4].OMOP Software Tools – OHDSI, https://www.ohdsi.org/software-tools/ (accessed January 4, 2025).
  • [5].IeDEA, https://www.iedea.org/ (accessed January 5, 2025).
  • [6].Duda SN, Musick BS, Davies MA, Sohn AH, Ledergerber B, Wools-Kaloustian K, et al. The IeDEA data exchange standard: A common data model for global HIV cohort collaboration. MedRxiv 2020:2020.07.22.20159921. 10.1101/2020.07.22.20159921. [DOI] [Google Scholar]
  • [7].Lewis JT, Stephens J, Musick B, Brown S, Malateste K, Ha Dao Ostinelli C, et al. The IeDEA harmonist data toolkit: A data quality and data sharing solution for a global HIV research consortium. J Biomed Inform 2022;131. 10.1016/J.JBI.2022.104110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 2009;42:377–81. 10.1016/J.JBI.2008.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Hamilton CD, Swaminathan S, Christopher DJ, Ellner J, Gupta A, Sterling TR, et al. RePORT International: Advancing Tuberculosis Biomarker Research Through Global Collaboration. Clin Infect Dis 2015;61Suppl 3:S155–9. 10.1093/CID/CIV611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Gange SJ, Kitahata MM, Saag MS, Bangsberg DR, Bosch RJ, Brooks JT, et al. Cohort Profile: The North American AIDS Cohort Collaboration on Research and Design (NA-ACCORD). Int J Epidemiol 2007;36:294. 10.1093/IJE/DYL286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].HLB-SIMPLe Global Research Alliance, https://www.hlbsimple.org/ (accessed January 5, 2025).
  • [12].TREAT Asia - amfAR, The Foundation for AIDS Research, https://www.amfar.org/treat-asia/ (accessed January 5, 2025).
  • [13].Riches N, Vinikoor M, Guingane A, Johannessen A, Lemoine M, Matthews P, et al. Hepatitis B in Africa Collaborative Network: cohort profile and analysis of baseline data. Epidemiol Infect 2023;151. 10.1017/S095026882300050X. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].NCBO BioPortal, https://bioportal.bioontology.org/ (accessed January 5, 2025).

RESOURCES