Summary
Combining pertinent data from multiple studies can increase the robustness of epidemiological investigations. Effective “pre-statistical” data harmonization is paramount to the streamlined conduct of collective, multi-study analysis. Harmonizing data and documenting decisions about the transformations of variables to a common set of categorical values and measurement scales are time consuming and can be error prone, particularly for numerous studies with large quantities of variables. The psHarmonize R package facilitates harmonization by combining multiple datasets, applying data transformation functions, and creating long and wide harmonized datasets. The user provides transformation instructions in a “harmonization sheet” that includes dataset names, variable names, and coding instructions and centrally tracks all decisions. The package performs harmonization, generates error logs as necessary, and creates summary reports of harmonized data. psHarmonize is poised to serve as a central feature of data preparation for the joint analysis of multiple studies.
Keywords: data harmonization, data pooling, data integration, R package, data management
The psHarmonize R package assists users with pre-statistical data harmonization. Data harmonization combines data from multiple sources to create a unified dataset with a common data dictionary for downstream analysis. This process can be time consuming and error prone. psHarmonize allows users to create a set of coding instructions in a single “harmonization sheet.” The R package functionality pulls instructions from this sheet to manipulate the raw data, perform transformations as needed, and generate and summarize a harmonized dataset.
Highlights
-
•
Pre-statistical data harmonization can be error prone and time consuming
-
•
The psHarmonize R package performs data transformations useful in harmonization
-
•
Coding instructions are compiled in a unified “harmonization sheet”
-
•
After harmonization, psHarmonize generates summary R Markdown output
The bigger picture
The number of statistical analyses relying on pooled data from studies that consider more than one group, or cohort, of participants measured on more than one occasion has greatly increased in recent years. There is a need for rigorous and reproducible methods to facilitate the harmonization of pooled data. Data harmonization refers to the combination of data from different studies in a way that users can compare them and use them. Harmonization of multiple datasets can require copious coding and can easily result in disjointed scripts that may not always guarantee reproducibility or result in a clean track record of research team decision-making. The psHarmonize R package facilitates the central compilation of all programming instructions, as well as the efficient generation of descriptive statistics of the harmonized data.
Introduction
Data harmonization unites data from multiple sources to create a final data product that can be used for pooled statistical analyses. Harmonization steps include defining the research question, selecting studies to draw from, identifying variables of interest, finalizing units of measurement and common categorical values across all datasets through iterative decision-making, developing and executing code to perform harmonization, assessing quality, and disseminating final data products.1 This pre-statistical harmonization can be time consuming and is often error prone. The R package psHarmonize2 is designed to provide centralized functionality for synthesizing multiple datasets, recording rationale for pre-statistical harmonization decisions, accomplishing the harmonization itself, and producing reproducible descriptive summaries for data checking, all in a straightforward R environment.
Harmonization workflows and general principles3 have been previously described, with noteworthy applications and resultant work products including the Cohort and Longitudinal Studies Enhancement Resources (CLOSER) initiative to harmonize multiple biomedical and social cohort studies based in the United Kingdom4; the MINDMAP project, including six studies of healthy aging and mental well-being in cities5; ADataViewer, an interactive tool to explore harmonized variables for over 20 Alzheimer’s disease datasets6; and multi-wave survey data frequently used in social sciences research.7,8 The workflows that support these impactful projects carefully delineate strategic source data selection, harmonized variable definitions, and decision documentation but are largely built on custom computational script development, thereby creating challenges for direct computational transferability across a range of studies.
An emerging set of software provides functionality for increasingly reproducible and transferrable harmonization. For example, the Data Steward Tool web application facilitates semantic integration of clinical data with a view to establishing common data models.9 The R packages retroHarmonize10 and cchsflow11 are both tailored to survey data harmonization and include functions for custom variable derivations and mappings; however, these derivations must be directly embedded into R code, which can be challenging for centralized tracking of decision-making. Our approach is most similar to that of Rmonize from the Maelstrom Research group,1,12 with harmonization input compiled external from R in human-readable Excel files to support centralized documentation of harmonization decisions along with custom code-based transformations. Particularly useful features of psHarmonize also include R Markdown summary reports that display cross-tabulations of source and harmonized data for quality assurance and built-in organization of variables according to domains if relevant (e.g., clinical, lifestyle, etc.), as well as the return of harmonized data in both wide and long formats.
The Dementia Risk Prediction Project (DRPP) is a consortium of sixteen United States and European cohorts focused on developing risk prediction models for Alzheimer’s disease and related dementias (ADRD) using risk factor data from multiple domains. The DRPP includes variables from domains including demographics (age, gender, race, ethnicity, educational attainment), clinical risk factors (blood pressure, cholesterol, glucose, Hemoglobin A1C [HbA1C], medication use), genetic risk factors (Apolipoprotein E [APOE]), behavioral risk factors (diet, smoking, alcohol use, physical activity), and clinical outcomes (dementia, stroke, and cardiovascular events). As a necessary step to enable robust risk prediction model development in the DRPP using data from all contributing cohorts, psHarmonize was developed by the DRPP data team for pre-statistical harmonization of 43 baseline and longitudinal variables for over 100,000 individuals.
Results
psHarmonize was initially developed to support pre-statistical harmonization of DRPP data. The harmonization workflow and specific use of psHarmonize functionality is illustrated in Figure 1. Harmonization steps and results using psHarmonize will be illustrated using DRPP data, but the methods are generalizable to any collection of multiple studies requiring pre-statistical harmonization prior to joint analysis.
Figure 1.
Harmonization workflow diagram
Pre-statistical harmonization begins with cataloging data from the datasets to be harmonized. Decisions on how to harmonize categorical and continuous data are made in collaboration with the research team. The harmonization sheet can be used to record all harmonization instructions; the harmonization sheet then serves as primary input into the harmonization() function in the psHarmonize R package. Long and wide harmonized datasets are created, and summary reports are generated for viewing results.
DRPP cohorts
The DRPP team harmonized sixteen longitudinal cohorts to create common variables within demographic, clinical, genetic, and behavioral risk factors, as well as clinical outcome domains. Cognitive testing data for clinical dementia outcomes were harmonized separately using item response theory methodology and are not discussed here. The sixteen DRPP cohorts include the Age, Gene/Environment Susceptibility-Reykjavik Study (AGES),13 the Framingham Heart Study (FHS) Original Cohort,14,15 the FHS Offspring Cohort,16,17 the FHS New Offspring Cohort, FHS Third Generation,18 FHS Omni 1 and 2, the Atherosclerosis Risk in Communities Study (ARIC),19 the Cardiovascular Health Study (CHS),20 the Kuakini Honolulu-Asia Aging Study (HAAS),21 the Multi-Ethnic Study of Atherosclerosis (MESA),22 Reasons for Geographic and Racial Differences in Stroke (REGARDS),23 the Sacramento Area Latino Study on Aging (SALSA),24 Whitehall II,25 The Rotterdam Study,26 and the Three-City Study.27 The Rotterdam Study and the Three-City Study did not contribute raw data and serve specifically as external validation cohorts. DRPP-harmonized variables and their domains are listed in Table 1.
Table 1.
Harmonized variables and domains in DRPP
| Domain | Variables | Units/categories | Label |
|---|---|---|---|
| Demographic | age | years | age of participant at that visit |
| Demographic | education | no school/grade school; high school; technical/vocational/college/graduate/professional | educational categories |
| Demographic | gender | female; male | gender |
| Demographic | race_ethnicity | White; Black; Hispanic; Asian; other | race/ethnicity of participant |
| Clinical risk factors | A1C | % | hemoglobin A1C |
| Clinical risk factors | BMI | kg/m2 | body mass index |
| Clinical risk factors | casual_plasma_glucose | mg/dL | plasma glucose taken when patient has not fasted |
| Clinical risk factors | diastolic_blood_pressure | mmHg | average of diastolic blood pressure; used right arm when the choice was presented |
| Clinical risk factors | fasting_plasma_glucose | mg/dL | fasting plasma glucose |
| Clinical risk factors | HDL_cholesterol | mg/dL | high-density lipoprotein cholesterol |
| Clinical risk factors | height | cm | height (cm) |
| Clinical risk factors | LDL_cholesterol | mg/dL | low-density lipoprotein cholesterol |
| Clinical risk factors | systolic_blood_pressure | mmHg | average of systolic blood pressure; used right arm when the choice was presented |
| Clinical risk factors | total_cholesterol | mg/dL | total cholesterol |
| Clinical risk factors | use_of_anti_diabetic_medication | yes; no | use of anti-diabetic medication at current visit |
| Clinical risk factors | use_of_anti_hypertensive_medication | yes; no | use of anti-hypertensive medication at current visit |
| Clinical risk factors | use_of_lipid_lowering_medication | yes; no | use of lipid-lowering medication at current visit |
| Clinical risk factors | weight | kg | weight (kg) |
| Lifestyle | pa_met_per_week | metabolic equivalent of task (MET) (min/week) | moderate and vigorous physical activity |
| Outcomes | cesd_score | – | Center for Epidemiological Studies-Depression (CESD) |
| Outcomes | chd | yes; no | coronary heart disease (CHD); includes heart failure and/or heart attack |
| Outcomes | chd_followup | days | days to CHD event, death, or end of follow-up |
| Outcomes | cvd | yes; no | cardiovascular disease (CVD); includes CHD and/or stroke |
| Outcomes | cvd_followup | days | days to CVD event, death, or end of follow-up |
| Outcomes | death | yes; no | death indicator |
| Outcomes | death_followup | days | days to death or end of follow-up |
| Outcomes | dementia | yes; no | dementia indicator |
| Outcomes | dementia_followup | days | intermediate variables used to calculate dementia follow-up |
| Outcomes | depression_assess | yes; no | indicator of depression; based on CESD or Geriatric Depression Scale (GDS) |
| Outcomes | gds_score | – | Geriatric Depression Scale |
| Outcomes | stroke | yes; no | stroke indicator |
| Outcomes | stroke_followup | days | days to stroke event, death, or end of follow-up |
| Chronic diseases | diabetes | normal; pre-diabetes; diabetes | diabetes definition based on fasting plasma glucose, random casual glucose, A1C, and use of anti-diabetic medication |
| Chronic diseases | hypercholesterolemia | normal; elevated; hypercholesterolemia | hypercholesterolemia definition based on use of lipid-lowering medication and total cholesterol |
| Chronic diseases | hypertension | normal BP; elevated BP; hypertension stage 1; hypertension stage 2 | hypertension definition based on use of anti-hypertensive medication and blood pressure |
| Genetic | APOE | 34/44; 33; 22/23; 24 | APOE allele carrier |
| Dates | denom | yes | indicator of participant being in denominator for that visit |
| Dates | cal_time | days | days from 1960-01-01 to current visit |
| Dates | days_to_visit | days | days from initial visit to current visit |
| Cognitive tests | mmse | – | cognitive tests; mini-mental state examination |
| Behavioral | AHA_Score | 0; 1; 2 | American Heart Association diet score; 0 = poor; 1 = intermediate; 2 = ideal |
| Behavioral | alcohol | g/week | alcohol consumption in grams/week |
| Behavioral | smoking | non-smoker; former smoker; current smoker | smoking status |
The psHarmonize package was used by the research team to harmonize data from multiple cohorts. This table lists the 43 variables that were harmonized.
Meta-data
Creation of meta-data datasets to centrally inventory all data resources is a critical first step at the start of any harmonization workflow. See Figure 2 for a screenshot of the meta-data dataset for a small number of variables contributed by the SALSA to the DRPP. The meta-data dataset has a row for every variable in each raw dataset contributed to the DRPP and columns for dataset name, variable name, variable label (if provided), cohort, and visit number. While psHarmonize functionality is not explicitly dependent on the meta-data dataset, this initial inventory of all raw data was a necessary component of the workflow for the efficient identification of variables in the DRPP’s prioritized domains across datasets and cohorts and would be critical for any harmonization effort.
Figure 2.
Meta-data for a small number of variables contributed to the DRPP by the SALSA
This is a small example of meta-data cataloging that should be completed for each dataset prior to harmonization. This will facilitate the organized retrieval of variables and their original data sources.
Harmonization decision-making and execution using the “harmonization()” function
To use psHarmonize, all transformation information as well as harmonized variable names are compiled into a “harmonization sheet.” The harmonization sheet may be created in .xlsx format or as a data frame in R. The harmonization sheet not only serves as a set of instructions and code to be run with the harmonization() function but also allows for centralized documentation of coding decisions as well as general “notes” (Figure 3). These coding notes and general notes are included in the harmonization sheet itself and are also displayed in the summary report along with summary statistics. This can help explain and track decisions that are made during harmonization. Final units of variables can also be tracked within the harmonization sheet. These are also displayed in the summary reports where appropriate.
Figure 3.
Example harmonization sheet
Examples of lines from the harmonization sheet for height (A) and education (B and C) across multiple cohorts are presented here. Height was a continuous variable and was harmonized to centimeters as the final unit for all cohorts. When necessary, a "code_type" of “function” was used to convert measurements in inches to centimeters as illustrated in (A) for FHS – Cohort. (B) and (C) illustrate harmonization of the education variable. Since this is a categorical variable for many studies, a “code_type” of “recode category” was used to map original source variables to harmonized variables for the DRPP. More complex variable transformations using multiple source variables as input were combined using a “function,” as illustrated in (C) for the SALSA.
Continuous variables are often measured in different units across cohorts; for example, height measurements in the DRPP were made in either inches or centimeters, depending on the cohort. For height, the DRPP decided on centimeters as the common unit. The harmonization sheet allowed the DRPP data team to record which cohorts, and depending on the variable, which visits within each cohort, required conversion to ensure common measurement units in the harmonized dataset. Specifically for height, the conversion code of “x ∗ 2.54” was used in “code1” to convert inches to centimeters, with an entry of “function” in “code_type” to indicate that the entry in “code1” should be interpreted as an R function (Figure 3A). For height variables that were reported in centimeters in the original data, the “code1” and “code_type” columns were left blank to indicate that no transformation was needed. In this example, the “x” in the “x ∗ 2.54” “code1” instruction refers to the vector of a non-harmonized source variable that is specified in the “source_item” column of the harmonization sheet. R then defines and calls the function that is provided with the data from the “source_item” column used for “x.” Multiple "source_item" columns can be used as well. The “source_item” column names would be separated by a semi-colon, and the “x”s would have a numeric suffix. For example, a BMI variable could be calculated with a “source_item” value of “weight; height” and a "code1" value of “x1/x2∗∗2” (provided weight was in kilograms and height in meters). The "code1" syntax and function is described in more detail in the experimental procedures section.
As another example, all but two cohorts had cholesterol variables (total, high-density lipoprotein, and low-density lipoprotein) measured in mg/dL. In order to convert cholesterol values from mmol/L to mg/dL in the AGES and Whitehall II, the code “x ∗ 38.67” was inserted for “code1” along with a code type of “function” for these two cohorts in the harmonization sheet. These fields were left blank for all other cohorts.
Categorical variables also often have disparate labels across studies. For example, the DRPP investigator team designated harmonized categories for education as “no school/grade school,” “high School,” and “technical/vocational/college/graduate/professional” (Figure 3B). This decision was made after multiple conversations among the DRPP team and the individual cohorts to understand the nature of the education data that were collected for each. Most cohorts required some type of recoding to the DRPP-harmonized education categories. For example, the ARIC had education categories of “1 = grade school or 0 years education,” “2 = high school, but no degree,” “3 = high school graduate,” “4 = vocational school,” “5 = college,” and “6 = graduate school or professional school.” The ARIC’s lowest educational category was categorized to DRPP’s lowest category (“1 = grade school or 0 years education” to “no school/grade school”). ARIC’s two high school education categories were categorized to DRPP’s high school educational category (“2 = high school, but no degree” and “3 = high school graduate” to “high school”). The ARIC’s educational levels of “4 = vocational school” and higher were harmonized to DRPP’s category that included vocational schools and college (“4 = vocational school,” “5 = college,” and “6 = graduate school or professional school” to “technical/vocational/college/graduate/professional”). This recoding was done using a "code_type" of “recode category” in the harmonization sheet (Figure 3B).
Other cohorts had a continuous education years variable that was categorized for harmonization in the DRPP. The SALSA has an indicator variable for any education (0 for no and 1 for yes) and a second variable to indicate the years of education. For DRPP harmonization, no education (0 for education status) was harmonized to “no school/grade school”; an education status of 1 and education years equal to or greater than 0 and less than 9 as “no school/grade school'”; an education status of 1 and education years equal to or greater than 9 and less than 13 as “high school”; and an education status of 1 and education years equal to or greater than 13 as “technical/vocational/college/graduate/professional.” In order to complete this harmonization, a "code_type" of “function” that acted on two source variables was used (Figure 3C).
Harmonized data visualization
psHarmonize can be used to generate R Markdown reports to compare the source categories and frequencies within each cohort with their harmonized values. Continuous variables can also be plotted on their original and harmonized scales within cohorts. In the DRPP, these reports were used for communication with all cohorts to ensure the reasonableness of the harmonization. In addition, psHarmonize provides a summary report based only on the harmonized values. This facilitated the description of the full harmonized data resource to the DRPP team (Figures 4 and 5).
Figure 4.
Plot of harmonized weight data from 14 harmonized cohorts
This plot shows the harmonized weight variable in kg and the raw weight variables of various units. Unit conversion was required for the ARIC, CHS, all of the FHS cohorts, Multi-Ethnic Study of Atherosclerosis (MESA), and SALSA. Coded survey values were set to missing in the MESA and SALSA.
Figure 5.
Summary output of harmonized data taken from summary comparison report
Example cross-tabulation of original source values and harmonized values for education in the ARIC dataset. This type of R Markdown summary report is automatically generated by the psHarmonize R package.
Independent harmonization by external groups
psHarmonize is structured to facilitate independent harmonization by external groups when data are not centrally available. For the DRPP, the international Rotterdam Study and the Three-City Study were unable to contribute raw data to the central data team. In order to facilitate the harmonization of their local data to DRPP-harmonized variables, the psHarmonize package was sent to both cohorts, along with an example harmonization sheet and with DRPP-harmonized values completed to the extent possible on the harmonization sheet. These two external cohorts were able to fill in and create their own harmonization sheet that allowed them to convert their data into a version that aligned with the DRPP’s local harmonized data. The fields they were required to enter included source dataset names, source variable names, and the raw values that needed to be converted to the harmonization values. For example, for the external cohorts to harmonize education values to “no school/grade school,” “high school,” “technical/vocational/college/graduate/professional,” we pre-filled the harmonization sheet with “= no school/grade school; = high school; = technical/vocational/college/graduate/professional” in the "code1" column, and “recode” in the "code_type" column. The external cohorts could simply enter the values that correspond with these values. This might look like “1 = no school/grade school; 2 = high school; 3 = technical/vocational/college/graduate/professional.” The Rotterdam and Three-City groups were also asked to fill in the columns that were used for documentation, such as “coding_notes” and “notes.” After both cohorts harmonized their data, they used psHarmonize locally to create the summary R Markdown reports and send them to the DRPP central data team for review. This allowed for a quality assessment of their harmonization and the opportunity to request any needed changes after review by the DRPP team.
DRPP data in use
The resulting DRPP-harmonized dataset is now being used in downstream analyses. For example, the DRPP research team created a dashboard that allows users to interactively create visualizations to explore the harmonized data. The long and wide datasets allow the user to quickly graph and group variables longitudinally (with the long dataset) and create frequencies at the patient level (with the wide dataset). The data are also being used to create new and validate existing dementia risk prediction models. New models are being developed using penalized competing risk survival methods, as well as machine learning techniques. Harmonized DRPP data are made available to approved users by submitting a request to the DRPP research team by visiting https://drpp.northwestern.edu/research/.
Discussion
Data harmonization is applicable in health research, as well as a variety of other fields such as environmental science,28 the oil industry,29 social science,30 and psychology.31 The psHarmonize workflow and R package are demonstrated here for the DRPP project, but the procedures and functionality would be useful in any field that requires pooling multiple datasets together to create a harmonized dataset. The harmonization() R command completes all harmonization according to the instructions provided and is accomplished through concise R code.
Pre-statistical harmonization of multiple datasets can require copious coding and can easily result in disjointed scripts that may not always guarantee reproducibility or result in a clean track record of research team decision-making. The overall workflow and computational approach developed by our team ultimately centers on a human-readable file of harmonization instructions, either in .xlsx format or as an R data frame, that can be reproducibly executed in R without lengthy scripting. The user creates this harmonization sheet that drives the data processing, and all functionality is built around a central user-specified set of harmonization instructions. psHarmonize addresses multiple organizational and coding challenges in a unified framework: (1) rather than manually renaming and concatenating multiple datasets, users specify source dataset names, variable names and, if desired, organizational variable domain names and map to harmonized values with a corresponding desired variable name without disrupting the integrity of the original data source; (2) the “recode category” option in the harmonization sheet allows users to specify cohort-specific mappings to common values in one central, human-readable file; (3) conflicting units for continuous measurements can easily be converted using the “function” option in the harmonization sheet; (4) more complicated data conversions are achievable using the “function” option that lets the user enter R functions requiring multiple input variables; (5) the harmonization sheet serves as organically developed documentation that keeps track of which raw variable(s) were used to create the harmonized variables for each cohort; (6) harmonized data can be extracted in long (visits stored in rows) or wide (visits stored in columns) format; and (7) after a harmonized dataset is created, reproducible, tabbed reports allow users to disseminate summary statistics of the harmonized dataset. This allows others on the team to quickly see what has been harmonized and make checks regarding the accuracy of the harmonization.
For the Rotterdam Study and Three-City Study that will serve only as external validation cohorts in the DRPP, and thus do not intend to transfer source data to the project, the psHarmonize workflow was particularly useful. The harmonization sheet provided efficient, structured harmonization guidance that could be partially completed by the DRPP team to ensure the central coordination of data for both the internal and external cohorts. However, psHarmonize has also proven transferrable to independent groups navigating harmonization of real-world data. In addition to being a resource for the DRPP parent project, an independent research team used psHarmonize to harmonize data from the UKBiobank32,33 to align with DRPP-variable definitions. In addition, the psHarmonize R package was used to harmonize data for the Collaborative Cohort of Cohorts for COVID-19 Research (C4R).34
The psHarmonize R package has some limitations: (1) complicated or lengthy programming tasks may prove difficult to read in the harmonization sheet. (2) Some programming tasks, such as aggregating data to the visit level for complex longitudinal datasets, may need within-study data pre-processing before harmonization is performed. (3) Harmonization functionality currently requires input datasets to be flat files (.csv files, for example) in the wide format (one row per participant). (4) While our package can detect “processing” errors (such as a source dataset or item not present), it would not be able to detect a “content” error. For example, if a user provided a recode or function code to create values that did not make logical or analytic sense, but the recode or function was still executable given the inputs, then the package would not be able to detect this type of error. (5) High-dimensional wearables, images, and genomic data are not currently supported.
Alignment of large-scale data resources across multiple studies promises robustness of analytic findings and, for health research, the possibility for improvement in both clinical care and public health. The burgeoning collection of sophisticated statistical methods and machine learning tools for multi-cohort analysis only hold promise if applied to rigorously prepared data sources. Pre-statistical harmonization, typically the very first step to pool data resources for joint analysis, thus demands precision, cohesivity, reproducibility, documentation, and clean visualization for review by research teams. Our harmonization workflow that is supported by the psHarmonize R package accomplishes all of these critical steps and holds tremendous promise for streamlined pre-statistical data harmonization in multiple research settings.
Experimental procedures
Resource availability
Lead contact
Further information should be directed to and will be fulfilled by the lead contact, John Stephen (john.stephen@northwestern.edu).
Data and code availability
Harmonized DRPP data are made available to approved users by request at https://drpp.northwestern.edu/research/. To request access to the data, users should click on the link called the “interest form,” which will open a web form that asks for contact information, the research question, and specific variables of interest. Once submitted, the DRPP executive committee will review the data request and, if approved, facilitate formal data access procedures. The psHarmonize R package is archived on Zenodo at Zenodo: https://doi.org/10.5281/zenodo.111228852 and is available on Github at https://github.com/NUDACC/psHarmonize.
Data inventory, meta-data, and decision-making
Raw data in the demographic, clinical, behavioral, genetic risk factor, and clinical outcome domains were received from 14 of the 16 DRPP cohorts. Variables were inventoried, and a meta-data data frame (Figure 2) was created in R with a row for every variable received and columns for dataset names, variable labels as applicable, the visit number for longitudinal measurements, and the cohort to which the variable belongs. These meta-data facilitated straightforward searching through all DRPP data for the identification of variables across cohorts. Extensive review of cohort documentation and direct contact with cohort data managers and investigators were critical to complete the inventory.
Descriptive statistics of the raw data were generated for each cohort and were critical for decision-making for harmonization. For continuous variables, boxplots and histograms were generated, and common units of measurement were decided upon by the DRPP investigator team and content experts. For categorical variables, individual tables of frequencies and counts were reviewed across cohorts and, for most variables (e.g., race, ethnicity, education, diet, smoking, alcohol use), revealed differences in coded categories. Final harmonized categories to which cohort-specific categories could be mapped were decided by the DRPP investigators, with critical input on variable interpretation and ascertainment methods from each cohort team.
Two of the DRPP cohorts, the Rotterdam Study and the Three-City Study, did not contribute raw data but are included in the DRPP to serve as validation cohorts for risk prediction models. These two external cohorts performed their own data inventory and confirmed their agreement with the proposed continuous variable transformations and proposed harmonized categories.
Harmonization sheet and harmonization() function
The primary R function in psHarmonize, harmonization(), calls a harmonization sheet imported from .xlsx format as input or specified as an R data frame (Figure 3). The structured harmonization sheet catalogs variable names and, if relevant, domains (e.g., clinical, behavioral, outcomes) in the source dataset, provides R code instructions for the systematic mapping or conversion of source variables to harmonized variables, specifies the variable name to be used in the harmonized dataset, and tracks notes that are relevant for the harmonization process. Multiple cohorts can be tracked in this harmonization sheet, thus maintaining a central harmonization pipeline for multiple source datasets. psHarmonize facilitates mappings of categories from a source dataset to common categories, including many-to-one mappings (Figure 3). psHarmonize also facilitates both categorical and continuous data transformations requiring combinations of multiple source variables.
When the harmonization() function is called, the harmonization sheet is first checked to make sure it is in the correct format and has the required columns. For example, the function checks that the sheet has one row per combination of "study" (cohort), "visit", and "item" (harmonized dataset). The function also checks if the user specified a “code_type” if they have provided code in the “code1” column.
Creating a shell
Once checks of the harmonization sheet are complete, the harmonization() function then prepares a long dataset “shell” of all the study participants and visits from all cohorts specified in the harmonization sheet. It does this by taking each of the datasets in the R global environment that are listed in the “source_dataset” column, taking their unique list of IDs, assigning the visit number that the user provided in “visit,” and the study name provided in “study.” These are then bound as rows to create a long dataset of study names, visits, and IDs. As harmonized variables are created, they are then joined into this data frame.
Error log
Next, an error log is created based on the harmonized item instructions from the harmonization sheet. This error log is a data frame that will populate with either “completed” or “not completed” as harmonized variables are being created.
Creating a long version of each variable
The harmonization() function loops through the potential harmonized variables listed in the harmonization sheet (column “item”). A subfunction called “create_long_dataset()” is called for each potential harmonized variable. This function then loops through every row of the harmonization sheet that corresponds to the harmonized variable of interest. The function first checks if the “source_dataset” and “source_item” of the current row exists in R’s global environment. If either of these do not exist, then the function records “not completed” for this row, the reason it was not completed is recorded, and the loop advances to the next row.
The “create_long_dataset()” then loads the corresponding “source_dataset” from R’s global environment. The function renames the “source_item” variable to the “item” value the user provided, renames the “id_var” to "ID", adds the visit number to the data frame, and adds the cohort’s name to the data frame. The function then takes the “source_item” variable and makes no modifications, recodes categorical values, or calls a function, based on what is entered into “code1” and “code_type.”
If the user specifies “recode” in the “code_type” column, then the “code1” column is expected to be in the format of “old_val1 = new_val1; old_val2 = new_val2,” etc. Then, “code_modify_recode()” makes the appropriate recoding on the input variable if the “old_val” is present in the coding instructions. If a value is present in the raw data and no corresponding recode instruction is provided, then the original value remains.
If the user specifies “function” in the “code_type” column, then the “code1” column is expected to be text that is a valid function, with “x” in the function representing the “source_item.” For example, if a source dataset has height in inches, and the desired harmonization is height in cm, then the user could enter “x ∗ 2.54” in “code1.”
Multiple input variables are allowed given they are in the same source dataset. For example, BMI could be calculated by the user entering “x1/(x2∗∗2)” in "code1" and “weight; height” in “source_item” (assuming weight in kg and height in m).
The harmonization sheet also allows the user to specify a possible range for the final harmonized variables (for continuous variables). The user can specify a range in “possible_range” such as “[0, 100)” (0 up to but not including 100), and harmonized values outside of that range will be set to NA. Similarly, for a categorical variable, the user can specify a set of allowable harmonized values and indicate that all other values of the source variable should be set to NA if they are not part of the specified mappings. Any values that are set to NA under either of these circumstances are reported in the error log.
After the appropriate modification has occurred for the current harmonization sheet row, the data are added to a temporary dataset that accumulates all values for the current variable that is being harmonized. Once all values are harmonized for the current variable, the intermediate dataset is then merged onto the cohort shell that was created at the start (merged on "cohort", "visit", and "ID").
Once all of the variables are harmonized (all of the rows in the harmonization sheet are processed), a harmonization object is returned. This is a list with a long version of the harmonized dataset, a wide version of the dataset, the error log, and the harmonization sheet.
Long and wide version of dataset
The harmonized dataset is initially constructed as a long dataset. Each row represents a visit. A wide version of the same harmonized dataset is also constructed. In this case, the visit number is appended to the end of the variable name. Each row in the wide dataset represents an individual, and data from each visit are represented in multiple columns. Both the long- and wide-format datasets are included in the returned harmonization object.
Harmonization from multiple cohorts
The harmonization() function allows for data compilation from multiple cohorts. There is a field in the harmonization sheet called “study” where the user can enter the study’s name. The harmonized dataset will have a column called “cohort” that will store the cohort’s name associated with the data. This makes it easy to subset the harmonized data and/or stratify analyses if desired.
Descriptive summaries
After the data are processed and harmonized, the psHarmonize R package can create output describing the data (Figure 5). The user can create an error log that details which variables were successfully harmonized, which ones were not, and for what reason. The package also has a function that creates R Markdown reports including descriptive summaries of the final harmonized variables, as well as a report that compares the raw values from the source datasets to the harmonized values. The comparison report allows users to review the specific data transformations that took place to ensure the data were harmonized correctly. The report without comparisons allows the user to create a document that may be most useful for dissemination to others.
The R Markdown files group the summaries by cohort and then within cohorts by the harmonized variables. These categories are displayed within tabs in the R Markdown report. This allows the user to quickly switch between cohort and harmonized variables. Categorical variables are summarized using frequencies and bar charts. Continuous variables are summarized using numeric summaries (mean, min, median, max, etc.) as well as either a histogram (for one visit) or boxplots (for multiple visits).
Acknowledgments
This work was supported by R61NS120245 from the National Institute of Neurological Disorders and Stroke.
Author contributions
Conceptualization, J.J.S., N.B.A., and D.M.S.; methodology J.J.S., N.B.A., and D.M.S.; software, J.J.S. and M.M.; data curation, A.E.K., J.J.S., and P.C.; writing – original draft, J.J.S.; writing – review & editing, J.J.S., P.C., A.E.K., S.S., M.M., N.B.A., and D.M.S.; visualization, J.J.S.; supervision, N.B.A. and D.M.S.; funding acquisition, N.B.A.
Declaration of interests
N.B.A. receives funding from the National Institutes of Health and the American Heart Association. D.M.S. receives funding from the National Institutes of Health.
Published: June 14, 2024
Contributor Information
John J. Stephen, Email: john.stephen@northwestern.edu.
Denise M. Scholtens, Email: dscholtens@northwestern.edu.
References
- 1.Fortier I., Raina P., Van den Heuvel E.R., Griffith L.E., Craig C., Saliba M., Doiron D., Stolk R.P., Knoppers B.M., Ferretti V., et al. Maelstrom Research guidelines for rigorous retrospective data harmonization. Int. J. Epidemiol. 2017;46:103–105. doi: 10.1093/ije/dyw075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stephen J., Mansolf M. 2024. NUDACC/psHarmonize: v0.3.0. [DOI] [Google Scholar]
- 3.Cheng C., Messerschmidt L., Bravo I., Waldbauer M., Bhavikatti R., Schenk C., Grujic V., Model T., Kubinec R., Barceló J. A General Primer for Data Harmonization. Sci. Data. 2024;11:152. doi: 10.1038/s41597-024-02956-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.O’Neill D., Benzeval M., Boyd A., Calderwood L., Cooper C., Corti L., Dennison E., Fitzsimons E., Goodman A., Hardy R., et al. Data Resource Profile: Cohort and Longitudinal Studies Enhancement Resources (CLOSER) Int. J. Epidemiol. 2019;48:675–676i. doi: 10.1093/ije/dyz004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wey T.W., Doiron D., Wissa R., Fabre G., Motoc I., Noordzij J.M., Ruiz M., Timmermans E., van Lenthe F.J., Bobak M., et al. Overview of retrospective data harmonisation in the MINDMAP project: process and results. J. Epidemiol. Community Health. 2021;75:433–441. doi: 10.1136/jech-2020-214259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Salimi Y., Domingo-Fernández D., Bobis-Álvarez C., Hofmann-Apitius M., Birkenbihl C., Alzheimer’s Disease Neuroimaging Initiative, the Japanese Alzheimer’s Disease Neuroimaging Initiative, for the Aging Brain: Vasculature, Ischemia, and Behavior Study, the Alzheimer’s Disease Repository Without Borders Investigators, for the European Prevention of Alzheimer’s Disease (EPAD) Consortium ADataViewer: exploring semantically harmonized Alzheimer’s disease cohort datasets. Alzheimer's Res. Ther. 2022;14:69. doi: 10.1186/s13195-022-01009-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kołczyńska M. Combining multiple survey sources: A reproducible workflow and toolbox for survey data harmonization. Methodological Innovations. 2022;15:62–72. [Google Scholar]
- 8.Tomescu-Dubrow I., Wolf C., Slomczynski K.M., Jenkins J.C. 1st ed. John Wiley & Sons; 2023. Survey Data Harmonization in the Social Sciences. [Google Scholar]
- 9.Wegner P., Schaaf S., Uebachs M., Domingo-Fernández D., Salimi Y., Gebel S., Sargsyan A., Birkenbihl C., Springstubbe S., Klockgether T., et al. Integrative data semantics through a model-enabled data stewardship. Bioinformatics. 2022;38:3850–3852. doi: 10.1093/bioinformatics/btac375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Antal D., Kolczynska M., Kantanen P., Herrero D.H. 2021. Retroharmonize: Ex Post Survey Data Harmonization. Version 0.2.0. [Google Scholar]
- 11.Yusuf W., Vyuha R., Bennett C., Sequeira Y., Maskerine C., Manuel D.G. cchsflow: an open science approach to transform and combine population health surveys. Can. J. Public Health. 2021;112:714–721. doi: 10.17269/s41997-020-00470-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fabre G., group M. 2023. Rmonize: Support Retrospective Harmonization of Data. Version 1.0.1. [Google Scholar]
- 13.Harris T.B., Launer L.J., Eiriksdottir G., Kjartansson O., Jonsson P.V., Sigurdsson G., Thorgeirsson G., Aspelund T., Garcia M.E., Cotch M.F., et al. Age, Gene/Environment Susceptibility-Reykjavik Study: multidisciplinary applied phenomics. Am. J. Epidemiol. 2007;165:1076–1087. doi: 10.1093/aje/kwk115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dawber T.R., Meadors G.F., Moore F.E. Epidemiological Approaches to Heart Disease: The Framingham Study. Am. J. Public Health Nation's Health. 1951;41:279–281. doi: 10.2105/ajph.41.3.279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dawber T.R., Kannel W.B., Revotskie N., Stokes J., Kagan A., Gordon T. Some Factors Associated with the Development of Coronary Heart Disease—Six Years’ Follow-Up Experience in the Framingham Study. Am. J. Public Health Nation's Health. 1959;49:1349–1356. doi: 10.2105/ajph.49.10.1349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Feinleib M., Kannel W.B., Garrison R.J., McNamara P.M., Castelli W.P. The framingham offspring study. Design and preliminary data. Prev. Med. 1975;4:518–525. doi: 10.1016/0091-7435(75)90037-7. [DOI] [PubMed] [Google Scholar]
- 17.Kannel W.B., Feinleib M., McNamara P.M., Garrison R.J., Castelli W.P. An investigation of coronary heart disease in families. The Framingham offspring study. Am. J. Epidemiol. 1979;110:281–290. doi: 10.1093/oxfordjournals.aje.a112813. [DOI] [PubMed] [Google Scholar]
- 18.Splansky G.L., Corey D., Yang Q., Atwood L.D., Cupples L.A., Benjamin E.J., D’Agostino R.B., Fox C.S., Larson M.G., Murabito J.M., et al. The Third Generation Cohort of the National Heart, Lung, and Blood Institute’s Framingham Heart Study: design, recruitment, and initial examination. Am. J. Epidemiol. 2007;165:1328–1335. doi: 10.1093/aje/kwm021. [DOI] [PubMed] [Google Scholar]
- 19.The ARIC investigators The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. Am. J. Epidemiol. 1989;129:687–702. [PubMed] [Google Scholar]
- 20.Fried L.P., Borhani N.O., Enright P., Furberg C.D., Gardin J.M., Kronmal R.A., Kuller L.H., Manolio T.A., Mittelmark M.B., Newman A. The Cardiovascular Health Study: design and rationale. Ann. Epidemiol. 1991;1:263–276. doi: 10.1016/1047-2797(91)90005-w. [DOI] [PubMed] [Google Scholar]
- 21.White L., Petrovitch H., Ross G.W., Masaki K.H., Abbott R.D., Teng E.L., Rodriguez B.L., Blanchette P.L., Havlik R.J., Wergowske G., et al. Prevalence of Dementia in Older Japanese-American Men in Hawaii: The Honolulu-Asia Aging Study. JAMA. 1996;276:955–960. [PubMed] [Google Scholar]
- 22.Bild D.E., Bluemke D.A., Burke G.L., Detrano R., Diez Roux A.V., Folsom A.R., Greenland P., Jacob D.R., Kronmal R., Liu K., et al. Multi-Ethnic Study of Atherosclerosis: objectives and design. Am. J. Epidemiol. 2002;156:871–881. doi: 10.1093/aje/kwf113. [DOI] [PubMed] [Google Scholar]
- 23.Howard V.J., Cushman M., Pulley L., Gomez C.R., Go R.C., Prineas R.J., Graham A., Moy C.S., Howard G. The reasons for geographic and racial differences in stroke study: objectives and design. Neuroepidemiology. 2005;25:135–143. doi: 10.1159/000086678. [DOI] [PubMed] [Google Scholar]
- 24.Haan M.N., Mungas D.M., Gonzalez H.M., Ortiz T.A., Acharya A., Jagust W.J. Prevalence of Dementia in Older Latinos: The Influence of Type 2 Diabetes Mellitus, Stroke and Genetic Factors. J. Am. Geriatr. Soc. 2003;51:169–177. doi: 10.1046/j.1532-5415.2003.51054.x. [DOI] [PubMed] [Google Scholar]
- 25.Marmot M.G., Smith G.D., Stansfeld S., Patel C., North F., Head J., White I., Brunner E., Marmot M.G., Smith G.D. Health inequalities among British civil servants: the Whitehall II study. Lancet. 1991;337:1387–1393. doi: 10.1016/0140-6736(91)93068-k. [DOI] [PubMed] [Google Scholar]
- 26.Hofman A., Breteler M.M.B., van Duijn C.M., Krestin G.P., Pols H.A., Stricker B.H.C., Tiemeier H., Uitterlinden A.G., Vingerling J.R., Witteman J.C.M. The Rotterdam Study: objectives and design update. Eur. J. Epidemiol. 2007;22:819–829. doi: 10.1007/s10654-007-9199-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Antoniak M., Pugliatti M., Hubbard R., Britton J., Sotgiu S., Sadovnick A.D., Yee I.M.L., Cumsille M.A., Bevilacqua J.A., Burdett S., Stewart L. Vascular Factors and Risk of Dementia: Design of the Three-City Study and Baseline Characteristics of the Study Population. NED. 2003;22:316–325. doi: 10.1159/000072920. [DOI] [PubMed] [Google Scholar]
- 28.2023. Welcome to CANUE CANUE.https://canue.ca/ [Google Scholar]
- 29.Danyaro K.U., Liew M.S. In: Recent Trends in Information and Communication Technology Lecture Notes on Data Engineering and Communications Technologies. Saeed F., Gazem N., Patnaik S., Saed Balaid A.S., Mohammed F., editors. Springer International Publishing; 2018. A Proposed Methodology for Integrating Oil and Gas Data Using Semantic Big Data Technology; pp. 30–38. [Google Scholar]
- 30.Durand C., Peña Ibarra L.P., Rezgui N., Wutchiett D. How to combine and analyze all the data from diverse sources: a multilevel analysis of institutional trust in the world. Qual. Quant. 2022;56:1755–1797. [Google Scholar]
- 31.Curran P.J., Hussong A.M. Integrative Data Analysis: The Simultaneous Analysis of Multiple Data Sets. Psychol. Methods. 2009;14:81–100. doi: 10.1037/a0015914. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.2024. UK Biobank - UK Biobank.https://www.ukbiobank.ac.uk [Google Scholar]
- 33.Sudlow C., Gallacher J., Allen N., Beral V., Burton P., Danesh J., Downey P., Elliott P., Green J., Landray M., et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Collaborative Cohort of Cohorts for COVID-19 Research 2024. https://www.c4r-nih.org/. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Harmonized DRPP data are made available to approved users by request at https://drpp.northwestern.edu/research/. To request access to the data, users should click on the link called the “interest form,” which will open a web form that asks for contact information, the research question, and specific variables of interest. Once submitted, the DRPP executive committee will review the data request and, if approved, facilitate formal data access procedures. The psHarmonize R package is archived on Zenodo at Zenodo: https://doi.org/10.5281/zenodo.111228852 and is available on Github at https://github.com/NUDACC/psHarmonize.





