Figure 1.
Overview of the data harmonization process used by the Trans-Omics for Precision Medicine (TOPMed) Data Coordinating Center (DCC). A) Existing study data in diverse formats are curated by the database of Genotypes and Phenotypes (dbGaP), including accessioning and conversion to a consistent file format. B) Formatted data and associated metadata (e.g., variable descriptions) are stored in a TOPMed DCC relational database. C) The harmonized phenotype variable is defined, and metadata for multiple studies are searched to identify candidate phenotypic variables that potentially can be harmonized together to produce the desired harmonized variable (harmonization steps 1 and 2). D) Analytical tools that interact with the DCC database are used for quality control (QC) of study variables, implementation of harmonization algorithms, and documentation; harmonized results are added to the same DCC database as that shown in step B (harmonization steps 3–5). E) Files containing a multistudy, harmonized data set and associated documentation are produced. F) Data, metadata, and documentation are submitted to a National Institutes of Health (NIH) repository for controlled access by the scientific community, while documentation files in JavaScript Object Notation format containing software code and provenance tracking are submitted to a publicly available GitHub repository.