Evaluating the harmonisation potential of diverse cohort datasets

Sarah Bauermeister; Mukta Phatak; Kelly Sparks; Lana Sargent; Michael Griswold; Caitlin McHugh; Mike Nalls; Simon Young; Joshua Bauermeister; Paul Elliott; Andrew Steptoe; David Porteous; Carole Dufouil; John Gallacher

doi:10.1007/s10654-023-00997-3

. 2023 Apr 26;38(6):605–615. doi: 10.1007/s10654-023-00997-3

Evaluating the harmonisation potential of diverse cohort datasets

Sarah Bauermeister ^1,^✉, Mukta Phatak ², Kelly Sparks ³, Lana Sargent ⁴, Michael Griswold ⁵, Caitlin McHugh ², Mike Nalls ⁶, Simon Young ¹, Joshua Bauermeister ¹, Paul Elliott ⁷, Andrew Steptoe ⁸, David Porteous ⁹, Carole Dufouil ¹⁰, John Gallacher ¹

PMCID: PMC10232583 PMID: 37099244

Abstract

Data discovery, the ability to find datasets relevant to an analysis, increases scientific opportunity, improves rigour and accelerates activity. Rapid growth in the depth, breadth, quantity and availability of data provides unprecedented opportunities and challenges for data discovery. A potential tool for increasing the efficiency of data discovery, particularly across multiple datasets is data harmonisation.A set of 124 variables, identified as being of broad interest to neurodegeneration, were harmonised using the C-Surv data model. Harmonisation strategies used were simple calibration, algorithmic transformation and standardisation to the Z-distribution. Widely used data conventions, optimised for inclusiveness rather than aetiological precision, were used as harmonisation rules. The harmonisation scheme was applied to data from four diverse population cohorts.Of the 120 variables that were found in the datasets, correspondence between the harmonised data schema and cohort-specific data models was complete or close for 111 (93%). For the remainder, harmonisation was possible with a marginal a loss of granularity.Although harmonisation is not an exact science, sufficient comparability across datasets was achieved to enable data discovery with relatively little loss of informativeness. This provides a basis for further work extending harmonisation to a larger variable list, applying the harmonisation to further datasets, and incentivising the development of data discovery tools.

Supplementary Information

The online version contains supplementary material available at 10.1007/s10654-023-00997-3.

Keywords: Data harmonisation, Cohort, Data visualisation, C-surv data model, Data discovery, Datasets

Background

Data discovery, the ability to find data assets relevant to an analysis, is a critical component of a productive research environment. Efficient data discovery increases opportunity, improves rigour, and accelerates activity. For cohort studies, the complexity and variety of longitudinal data presents particular discovery challenges as data structures and labelling conventions are highly variable and typically under-documented. The growth of data repositories [1–5] providing global 3rd party researcher access to multiple datasets, increases the value of efficient data discovery.

A pre-requisite for efficient data discovery is harmonisation. The goal of harmonisation is to achieve comparability (inferential equivalence) between two or more variables by inferring a latent construct that the variables are considered to represent. For example, different reaction time tasks may be judged to represent a latent construct of cognitive processing speed. However, latent constructs are purpose-specific, as a latent construct of processing speed based on simple reaction time may be inadequate for a hypothesis based on choice reaction time. Harmonisation is easily conflated with standardisation where data are organised (structured and annotated) according to a standard model. Clearly, standardisation is a step towards harmonisation, but they are not coterminous. Standardisation is about conformability whilst harmonisation is about comparability.

Previous exercises in the harmonisation of cohort data have focussed on hypothesis-specific testing across multiple datasets. The Maelstrom consortium has adopted a different approach by developing multiple harmonised datasets available for 3rd party use [6]. The Integrative Analysis of Longitudinal Studies of Aging (IALSA) consortium also supports a broad range of multi-cohort analyses; making the harmonisation code available for many of them [7]. The CLOSER consortium provides an extensive programme of topic-specific harmonisation initiatives across 19 cohorts [8], designed to test a broad range of hypotheses. Other initiatives are more focussed. Adhikari et al. describe harmonising 20 variables from two pregnancy cohorts to better understand risk factors for pre-term birth [9], whilst Almeida et al. describe and validate a pipeline for neurodegeneration analyses, also using two population cohorts [10]. Further larger harmonisation collaborative projects include the Research Advancement through Cohort Cataloguing and Harmonization (ReACH) [11], the EU Child Cohort Network [12] and the Melbourne Children’s LifeCourse (LifeCourse) [13] initiatives.

Here we focus on harmonisation for data discovery. Focussing on data discovery requires fewer assumptions about likely future use of the harmonised data, enabling the use of more generic latent constructs. It is also sympathetic to relatively simple harmonisation rules. Value lies in enabling the availability of relevant data from multiple datasets, to be assessed accurately and efficiently, prior to an access request. This reduces search costs for data discovery and reduces the risk of accessing and processing uninformative data. The work was conducted within the Dementias Platform UK Data Portal (DPUK) [1]. This is an integrated research environment for bona fide researchers interested in accessing cohort data for secondary analysis. The Portal provides tools for data discovery, an access management system and a virtual desktop interface (VDI) for analysis. The VDI provides a virtual desktop with preinstalled statistical programmes (e.g., Stata, Python, SPSS, R, MATLAB) for analysing complex multi-modal datasets. The data discovery tools (Cohort Matrix, Cohort Directory, Cohort Explorer) provide researchers with the ability to investigate appropriate cohorts at increasing levels of granularity. The Cohort Explorer (https://portal.dementiasplatform.uk/CohortExplorer) is premised on the harmonisation of a select number of 30 variables focused on neurodegeneration.

To evaluate the harmonisation potential of population cohort data for data discovery, members of the Alzheimer’s Disease Data Initiative (ADDI) [4] and DPUK formed a Data Harmonisation Group, to attempted the harmonisation of a comprehensive set (124 variables) of neurodegeneration related variables, across four diverse population cohort datasets. The aim of this project is to inform the development of discovery tools across the ADDI Workbench and the DPUK Data Portal.

Methods

Variable selection

A set of 124 variables optimised for neurodegeneration was identified by consensus within the ADDI Data Harmonisation group. Variables were selected to reflect the frequency of being requested in Dementias Platform UK (DPUK) data access proposals [1], to cover a range of data modalities, and to include modifiable and non-modifiable risk factors.

Standardisation

Datasets were curated to a common structure and labelling conventions using C-Surv as the data model [14]. C-Surv is a simple four level acyclic taxonomy intended to capture the breadth of data typically collected in research cohorts. The tiered structure supports grouped and individual variable discovery. C-Surv comprises 18 data themes (level 1) leading to > 146 data ‘domains’ (level 2), > 500 data ‘families’ (level 3) and then to a growing number of data ‘objects’ (level 4) i.e. the measured variable. C-Surv has been adopted by DPUK [1], Dementias Platform Australia [3], and the ADDI workbench [4]. Other models, developed for other purposes were available, such as the Observational Health Data Sciences and Informatics (OHDSI) OMOP Common Data Model for administrative health data [15], and CDISC Clinical Data Interchange Standards Consortium (CDISC) for trials data [16], but these have structural and semantic complexity that is alien to the cohort study design.

Schema development

The harmonisation schema was optimised to be inclusive of datasets by using relatively simple harmonisation rules and widely used value-labelling conventions. Three strategies for harmonisation, as described in the Maelstrom harmonisation guidelines [17] were used.

Simple calibration, using direct mapping between the source variable and the harmonised variable, was adopted for widely used standard metrics such as weight or height. Direct mapping, including cut-off points was used for validated clinical scales. The Gregorian calendar was used for dates and conventional units were used for age (years), durations (hours), concentrations (mg/ml), volumes (mm³), etc.

Algorithmic transformation was used for non-clinical questionnaire responses including lifestyle. The algorithm was selected to be inclusive by using a relatively simple transformation and was developed iteratively as it was applied to each dataset. Gender was transformed as male, female; smoking as ‘current, past, and never’, and ethnicity as white, black, Asian, mixed, other. Cohabitation was coded as single, married/cohabiting, divorced/separated, widowed, whilst education was considered as educational experience and transformed into junior or less, secondary, degree or equivalent, postgraduate or equivalent. For type of accommodation a straightforward transformation was house/bungalow, apartment, sheltered/residential, other.

Non-clinical cognitive performance scores were standardised into z-scores by default, with an option for refining this rule on a scale-by-scale basis according to the variable distribution. More sophisticated methods such as latent variable modelling or multiple imputation were not used.

Schema evaluation

The utility of the harmonisation rules was tested using four DPUK collaborating cohorts. These were selected on the basis of having diverse primary scientific objectives, providing longitudinal multimodal data, and being frequently requested by DPUK users. The cohorts were the Airwave Health Monitoring Study (Airwave); an occupational cohort [18], the English Longitudinal Study of Ageing (ELSA); a social science focussed study [19], Generation Scotland; a genetics cohort [20], and Memento; a neurodegeneration cohort [21]. The coverage of each cohort and overlap of variables across cohorts was assessed, along with the utility of the harmonisation rules. All analyses were conducted within the DPUK Data Portal [22].

Results

Core variables

The variable list comprised a range of modifiable and non-modifiable risk factors and metadata (Table 1). Of the 124 variables, most variables (n = 103) were present in the baseline data. However, for ELSA data, 18 variables were collected in subsequent waves. For Memento, two outcomes were collected through linkage to health records. For both ELSA and Memento, genetics data are available independently of study wave. The variables covered 15 out of the 18 data themes represented by C-Surv data model (Table 1). Themes not represented were linkage data (theme 14), healthcare utilisation data (theme 15), and device data (theme 18).

Table 1.

Core variable list

#	C-Surv theme	Variable	Strategy	Harmonisation rule
1	Administration:theme 1	Cohort ID	SC	Anonymised by cohort
2		Assessment date	SC	Gregorian calendar (yyyy-mm-dd)
3		Date of birth	SC	Gregorian calendar (yyyy-mm-dd)
4		Date of death	SC	Gregorian calendar (yyyy-mm-dd)
5		Cause of death	SC: text	ICD-11 categories 1–18
6		DNA extracted	SC	1 Yes; 0 No
7		Plasma collected	SC	1 Yes; 0 No
8		Serum collected	SC	1 Yes; 0 No
9		CSF collected	SC	1 Yes; 0 No
10	Sociodemographic:theme 2	Age	SC	Value: years 1–130
11		Gender	SC	1 male; 2 female
12		Ethnicity	AT	1 white; 2 Black; 3 Asian; 4 mixed 5 other
13		Cohabitation	AT	1 single; 2 married/cohabiting; 3 separated/divorced;4 widowed other
14		Years education	SC	Value: years range
15		Educational level	AT	1 postgrad; 2 degree; 3 secondary; 4 junior or less
16		Income	AT	Quantiles using local currency
17	Early life experience:theme 3	Childhood physical abuse	SC	1 Yes; 0 No
18		Adolescent physical abuse	SC	1 Yes; 0 No
19		Sexual abuse	SC	1 Yes; 0 No
20		Parental smoking behaviour	SC	1 Yes; 0 No
21	Medical history:theme 4	Type 1 diabetes diagnosis	AT	1 Yes; 0 No
22		Type 2 diabetes diagnosis	AT	1 Yes; 0 No
23		AD diagnosis	AT	1 Yes; 0 No
24		AD FTD diagnosis	AT	1 Yes; 0 No
25		AD mixed diagnosis	AT	1 Yes; 0 No
26		VaD diagnosis	AT	1 Yes; 0 No
27		PD diagnosis	AT	1 Yes; 0 No
28		Depression diagnosis	AT	1 Yes; 0 No
29		Self-report visual difficulty	AT	1 Yes; 0 No
30		Self-report hearing difficulty	AT	1 Yes; 0 No
31		Angina diagnosis	AT	1 Yes; 0 No
32		MI diagnosis	AT	1 Yes; 0 No
33		Hypertension diagnosis	AT	1 Yes; 0 No
34		Stroke diagnosis	AT	1 Yes; 0 No
35		Head injury	AT	1 Yes; 0 No
36		COPD diagnosis	AT	1 Yes; 0 No
37		Arthritis diagnosis	AT	1 Yes; 0 No
38		Current pain	AT	1 Yes; 0 No
39		Self-report general health	AT	1 Yes; 0 No
40		Medications	SC: text	Value: number prescribed
41	Family disease history: theme 5	Dementia parent	SC	1 Yes; 0 No
42		Dementia grandparent	SC	1 Yes; 0 No
43		Dementia sibling	SC	1 Yes; 0 No
44		AD parent	SC	1 Yes; 0 No
45		AD grandparent	SC	1 Yes; 0 No
46		AD sibling	SC	1 Yes; 0 No
47		VaD parent	SC	1 Yes; 0 No
48		VaD grandparent	SC	1 Yes; 0 No
49		VaD sibling	SC	1 Yes; 0 No
50		PD parent	SC	1 Yes; 0 No
51		PD grandparent	SC	1 Yes; 0 No
52		PD sibling	SC	1 Yes; 0 No
53		CHD parent	SC	1 Yes; 0 No
54		CHD grandparent	SC	1 Yes; 0 No
55		CHD sibling	SC	1 Yes; 0 No
56		Stroke parent	SC	1 Yes; 0 No
57		Stroke grandparent	SC	1 Yes; 0 No
58		Stroke sibling	SC	1 Yes; 0 No
59	Psychological status: theme 6	GHQ score	AT	Scale score
60		Self-report depression	AT	1 Yes; 0 No
61		Loss of interest	AT	1 Yes; 0 No
62		Depression score	AT	Scale score
63		EPQ Neuroticism	AT	Scale score
64		EQP Extraversion	AT	Scale score
65		Life satisfaction score	AT	Scale score
66		Job satisfaction score	AT	Scale score
67		Quality of Life score	AT	Scale score
68		Loneliness scale score	AT	Scale score
69	Cognitive status:theme 7	Immediate recall score	S	Z score
70		Delayed recall score	S	Z score
71		Digit symbol substitution score	S	Z score
72		Verbal fluency score	S	Z score
73		Choice reaction time mSec	S	Z score
74		Fluid intelligence score	S	Z score
75		MMSE score	SC	Scale score
76		ADAS cog total score	SC	Scale score
77		CDR total score	SC	Scale score
78		Subjective memory complaint	AT	1 Yes; 0 No
79		MCI diagnosis	AT	1 Yes; 0 No
80	Lifestyle: theme 8	Alcohol consumption	AT	Alcohol units per week, other
81		Smoking status	AT	0 never smoked; 1 past smoker; 2 current
82		Vigorous exercise	AT	1 Yes; 0 No
83		Moderate exercise	AT	1 Yes; 0 No
84		Walking	AT	1 Yes; 0 No
85		Sleep quality scale	AT	Scale score
86		Sleep hours per night	SC	Hours per night
87	Life functionality: theme 9	ADL score	AT	Scale score (higher value higher independence)
88		IADL score	AT	Scale score (higher value higher functioning)
89	Physical environment: theme 10	Number of house occupants	SC	Value (occupants)
90		Number of rooms	SC	Value (rooms)
91		Type of accommodation	AT	1 house/bungalow, 2 apartment, 3 residential/sheltered/ other
92		Pollution (grime in house)	SC	1 Yes; 0 No
93	Social environment: theme 11	Number of contacts/month	SC	Value (number of social contacts)
94		Social media sites used	SC	Value (number of sites used)
95		Social media use daily	SC	Value (types used daily)
96	Physical examination: theme 12	Height	SC	Value (cm)
97		Weight	SC	Value (kg)
98		BMI	SC	Value (ratio m²/kg)
99		Grip strength	SC	Value (kg)
100		Gait (walking) speed	SC	Value (m/sec)
101		Systolic BP	SC	Value (mm/hg)
102		Diastolic BP	SC	Value (mm/hg)
103	Imaging: theme 13	White matter volume	SC	Value (mm³ standardised)
104		Grey matter volume	SC	Value (mm³ standardised)
105		Left hippocampal volume	SC	Value (mm³ standardised)
106		Right hippocampal volume	SC	Value (mm³ standardised)
107		WM hyperintensities	SC	Value (mm³ standardised)
108		Amyloid PiB SUVR	SC	Ratio
109	Biosample assays: theme 16	Haemoglobin	SC	Value (mg/dl)
110		White cell count	SC	Value (mg/dl)
111		RBC count	SC	Value (mg/dl)
112		Total cholesterol	SC	Value (mg/dl)
113		HDL cholesterol	SC	Value (mg/dl)
114		Creatinine	SC	Value (mg/dl)
115		Glucose	SC	Value (mg/dl)
116		CRP	SC	Value (mg/dl)
117		Cortisol decrease	SC	Value (mg/dl)
118		Abeta 1–42	SC	Value (pg/ml)
119		Abeta 1–40	SC	Value (pg/ml)
120		Abeta 1–42	SC	Value (pg/ml)
121		Abeta 1–40	SC	Value (pg/ml)
122		Total tau	SC	Value (pg/ml)
123		P tau	SC	Value (pg/ml)
124	Molecular: theme 17	APOE status	SC	1 2/2; 2 2/3; 3 2/4; 4 3/3; 5 3/4; 6 4/4)

Open in a new tab

SC Simple calibration, AT Algorithmic transformation, S Standardisation

C-Surv themes not represented: Linkage data (theme 14), Healthcare utilisation data (theme 15), and Device data (theme 18)

Representation and distribution

Most variables (n = 120; 97%) were found in one or more cohorts. Memento, being primarily designed to investigate neurodegeneration, included most variables (n = 92). The other cohorts, designed to address a broader range of questions had fewer neurodegeneration-focused variables (Table 2). Of the 4 variables that were not found in any cohort, one was related to air pollution (pm_2.5 concentration) and another was loneliness assessment. That Mild Cognitive Impairment (MCI) status was not available in any cohort reflects the difficulty of capturing these data in a population setting. That ADAS-Cog score was not available reflects the use of this scale primarily in trials than in cohorts.

Table 2.

Distribution of core variables across cohorts

Harmonised dataset		Number of variables per cohort
C-Surv theme	Variables included: n	Airwave	ELSA	Generation scotland	Memento
Administration	9	6	7	6	9
Sociodemographic	7	6	7	7	6
Early life environment	4	0	4	0	0
Medical history	20	10	13	10	18
Family disease history	18	0	0	15	12
Psychological status	10	3	7	3	5
Cognitive status	11	4	5	4	6
Lifestyle	7	3	7	2	6
Life functionality	2	0	2	0	2
Physical environment	4	1	2	3	2
Social environment	3	0	3	0	0
Physical examination	7	5	7	5	6
Imaging	6	0	0	0	6
Biosample assays	15	8	8	4	13
Molecular data	1	1	1	1	1
Totals	124	47	73	60	92

Open in a new tab

The distribution of variables across cohorts also varied, with 34 variables being common to all cohorts, 10 in three cohorts, 30 in two cohorts and 46 in one cohort (Fig. 1). This shows the diversity of the selected cohorts and reflects the range of scientific purpose underlying these datasets. For example, that ELSA and Memento include 13 and 26 unique variables respectively reflects the distinctive scientific foci of these studies; ELSA being focussed on social factors underlying ageing, and Memento focussed more specifically on neurodegeneration.

Fig. 1 — Distribution of core variables across cohorts

Utility

Of the 120 variables that were represented in one or more datasets, 61 (51%) were directly mapped. Direct mapping was generally straightforward but did involve truncation of dates, and the interpretation of text for primary cause of death and medications. For alcohol consumption, although using units per week is translatable with most datasets an ‘other’ option was allowed for when consumption was present but not quantifiable. The harmonisation rules for each variable are shown in the supplementary materials (table S1).

Fifty three (44%) variables were transformed by algorithm. For 44 of these, this was a matter of using a ‘yes/no’ (present/absent) format. For ELSA, the presence or absence of a medical condition was inferred from the data of diagnosis, or a symptom rating score. For Generation Scotland the presence or absence of angina and myocardial infarction was inferred from self-reported heart disease. For Memento several outcomes were indicated as present by interpreting a rating scale score or by a clinical diagnosis. The remaining nine variables required more interpretation (Table 3). For smoking, there was close concordance between the harmonisation rules and the raw data with some interpolation required for ELSA data. This could have been easily addressed by simplifying the scale to a binary ‘ever smoked: yes/no’ format, although this would have been less informative for most of the datasets. Ethnicity was missing in Memento as by law these data are not permitted to be collected in France. Harmonising education was difficult as all the cohorts used qualifications as the index and these varied in detail and across jurisdiction (UK and France). The decision to harmonise on the basis of educational experience rather than qualifications provided a basis for greater integration, although it may be argued that the harmonised scale is less informative. Similarly, for cohabitation and housing type, where simplified scales were applied to the more detailed raw data. For household income local currency was used and aggregated into four quantiles of annual income. For exercise (vigorous, moderate and walking) a simple quantification was not possible due to the diversity of measurement and harmonisation was limited to presence or absence.

Table 3.

Application of algorithmic transformations across cohorts

Variable	Transformation	Cohort
Variable	Transformation	Airwave	ELSA	Generation Scotland	Memento
Smoking	Never	Never	Ever smoked Yes/no	Never	Never
	Ex	Ex	–	Ex	Ex
	current	current	–	current	current
Smoking?	Yes
Smoking?	no
Ethnicity	White	White	White	White	–
	Black	Other	Non white	–	–
	Asian	-	–	Asian	–
	Mixed	-	–	Mixed	–
Education	Post grad equivalent (ISCED 7/8)*	Post grad	–	–	Higher dipl
	Degree equivalent (ISCED 4–6)	Deg. equiv	Deg. Equiv	College/Uni	Degree
	Secondary (ISCED 2/3)	–	Higher ed	–	General Bac
		A level NVQ3	NVQ3	Highers	Tech Bac
		GCSE NVQ2	NVQ2	Standards	CAP/BEP
		NVQ1	NVQ1	CSE equivalent	Elementary
		–	Foreign/other	–
	< = Primary (ISCED 1)	–	–	Certificate	Primary
		No qualification	No qualification	No qualification	No qualification
Cohabitation	Single	Single	Single	Are you living as a couple? yes/no	Single
	Married/cohabiting	Married	Married	–	Married/cohabiting
		–	Remarried	–	–
		cohabiting	–	–	–
	Divorced/separated	Divorced	Divorced	–	Divorced/separated
		Separated	Separated	–	–
	widowed	–	Widowed	–	Widowed
	Other	Other	–	–	–
Housing type	House/bungalow	–	–	House/bungalow	Single family dwelling
	Apartment	–	–	Apartment/flat	Apartment
	Sheltered/residential	–	–	Hostel	Residential
		–	–	Mobile/caravan	Sheltered
		–	–	Sheltered	Religious community
		–	–	Homeless	Care home
	Other	–	–	Other	Other
Household income	Four quantiles using local currency	Annual: < £25,999, 26,000–37,999, 38,000–59,999, 600,000 +	Gross monthly and annual in Pounds Sterling	–	Monthly: €400– < 800 800– < 1200 1200– < 1800 1800– < 2500 2500 < 4000 4000– < 6000 6000 +
Vigorous exercise	Yes/No	–	Do you attend sports clubs, gym, exercise classes?	–	Days per week Hours per day Minutes per day No vigorous exercise
Moderate exercise	Yes/No	–	–	–	Days per week Hours per day Minutes per day No moderate exercise
Walking	Yes/No	–	–	–	Days per week Hours per day Minutes per day No walking

Open in a new tab

^*ISCED International standard classification of education [28]

The six cognitive performance scores were standardised to the Z distribution. The distributions for immediate recall (skew = − 0.42), delayed recall (skew = − 0.42), digit symbol substitution (skew = − 0.11), verbal fluency (skew = 0.31), were sufficiently Gaussian for Z-scores to be meaningful. For choice reaction time (skew = 1.09), a log_e normalisation was used before transforming to z-scores. For this exercise, fluid intelligence is an interpretation of the ELSA numeracy score from ELSA (skew = − 0.54).

Discussion

For a set of 124 variables, selected for relevance to neurodegeneration, a harmonisation schema designed for data discovery, was applied to data from four diverse population cohorts. Of the 120 variables that were found in the datasets, correspondence between the harmonised data schema and cohort-specific data models was complete or close for 111 (93%). For the remainder, harmonisation was possible with a marginal a loss of granularity. Overall, this demonstrated the feasibility and utility of using relatively simple harmonisation procedures for the purpose of data discovery.

Although these findings indicate value for data discovery, harmonisation is not an exact science and we have not described a mature process. The selection of variables, relevant to neurodegeneration was a reasonable starting point as it framed a specific use-case. Undoubtedly, the selection of specific variables reflected the research interest of the DPUK scientific community and the ADDI Data Harmonisation Group. However, a broad range of variables of generic interest were represented. From this limited variable-set a strong case can be made for incrementally expanding the range of harmonised variables. For a small number of variables (n = 9) using less granular harmonisation rules would have increased inclusiveness. Offering a selection of rules for investigators to choose which best suits their purpose would be straightforward.

Our model was developed using only four cohorts and not all cohorts had data on all variables. Given the use of relatively generic harmonisation rules, the addition of further cohorts with different patterns of ‘missingness’ is unlikely to materially affect the schema as described, but would inform its extension to other variables. For many variables the ‘Yes/No’ indicator was sufficiently generic that whether these variables should be construed as directly mapped or algorithmically transformed is moot. For the processing of free text, the manual interpretation of free text data used here is not scalable or necessarily consistent; the potential of natural language processing for rapid and consistent textual interpretation should be explored. The availability of biosamples was included in the variable list. Technically these are metadata, but were considered informative for data discovery. For cognitive performance, although the harmonisation process was straightforward, without claiming aetiological commonality, grouping tests according to widely used cognitive domains was judged a pragmatic solution. Harmonisation was not applied to longitudinal data. This was intentional to simplify the problem. However, the inclusive and generic nature of the harmonisation schema suggest that applying it longitudinally would be relatively straightforward.

The value of efficient data discovery is commensurate with growth in 3rd party data access and data complexity. Whilst national projects such as the ‘All of Us’ [23] and ‘UK Biobank’ [24] studies are specifically designed for 3rd party access, data discovery in most cohort studies remains challenging. These studies are not resourced to standardise their data, and consensus around how this may be achieved has not been reached. Data platforms, which provide global 3rd party access across multiple datasets provide an opportunity to develop these solutions, as they are positioned to develop harmonisation pipelines that can be applied systematically and consistently across datasets; enabling discovery at-scale and pace.

Efficient data discovery does not just require harmonisation. It also requires tools that exploit the potential that harmonisation brings. Existing cohort-based data discovery tools range from access to rudimentary spreadsheets, through online data dictionaries [25], to more sophisticated ‘shopping basket’ approaches offering discovery and selection [26]. The wide range of functionality and complexity of these tools is a strong argument for the development of tools that simplify discovery across datasets, and that follow-though into data selection. Using data platforms to consolidate approaches to data discovery and variable selection across multiple datasets incentivises the development of more ergonomic and powerful tooling.

The case for streamlining and standardising data discovery can be difficult to make. However, an example of where this has been transformative is the introduction of reference SNP cluster ID (rs) numbers [27]. By establishing rules around how to annotate genetic data, confidence in the provenance of data is increased, transaction costs of data discovery and access lowered, and rigour improved. The point being that simple solutions can be used to increase scientific opportunity rather than restrict academic freedom. This paper demonstrates that a similar exercise for cohort data is technically feasible and argues that it would be highly valuable. The cohorts who participated in this project have all deposited their data with DPUK and are available upon application through DPUK. The harmonised dataset used for this project will be available as an optional data format in the future, integrated within the DPUK curation programme [11].

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 62 KB)^{(61.5KB, docx)}

Supplementary file2 (DOCX 14 KB)^{(13.9KB, docx)}

Acknowledgements

We would like to acknowledge the pioneering work of Professor Isobel Fortier of McGill University in developing the field of data harmonisation for cohort data. Airwave: The Airwave Health Monitoring Study is funded by the Home Office (grant number 780-TETRA) with additional support from the National Institute for Health Research (NIHR) Biomedical Research Centre. The Airwave Study uses the computing resources of the UK MEDical BIOinformatics Partnership (UK MED-BIO: supported by the Medical Research Council MR/L01632X/1). ELSA: The English Longitudinal Study of Ageing was developed by a team of researchers based at University College London, NatCen Social Research, the Institute for Fiscal Studies, the University of Manchester and the University of East Anglia. The data were collected by NatCen Social Research. The funding is currently provided by the National Institute on Aging in the US, and a consortium of UK government departments coordinated by the National Institute for Health Research. Funding has also been received by the Economic and Social Research Council. Generation Scotland: Generation Scotland received core support from the Chief Scientist Office of the Scottish Government Health Directorates [CZD/16/6] and the Scottish Funding Council [HR03006]. Genotyping of the GS:SFHS samples was carried out by the Genetics Core Laboratory at the GSMAPP Access Policy v6-7 December 2016 FINAL 8 Wellcome Trust Clinical Research Facility, Edinburgh, Scotland and was funded by the Medical Research Council UK and the Wellcome Trust (Wellcome Trust Strategic Award “STratifying Resilience and Depression Longitudinally” (STRADL) Reference 104036/Z/14/Z.Memento: The Memento cohort was supported by a grant from the Fondation Plan Alzheimer (Alzheimer Plan 2008-15 2012) and sponsored by the Bordeaux University Hospital. This work was also conducted by the following: CIC 1401-EC, Bordeaux University Hospital, Inserm, and Bordeaux University Dementias Platform UK: The Medical Research Council supports DPUK through grant MR/TO333771. The PI is John Gallacher Mike Nalls: This research was supported in part by the Intramural Research Program of the NIH, National Institute on Aging (NIA), National Institutes of Health, Department of Health and Human Services; project numberZO1 AG000535, as well as the National Institute of Neurological Disorders and Stroke. Mike Nall’s participation in this project was part of a competitive contract awarded to Data Tecnica International LLC by the National Institutes of Health to support open science research, he also currently serves on the scientific advisory board for Clover Therapeutics and is an advisor to Neuron23 Inc.

Author contributions

All authors contributed to the conception, creation and development of the manuscript. Material preparation was conducted by John Gallacher and Sarah Bauermeister. The first draft of the manuscript was written by John Gallacher and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Funding

This work was supported by the UK Research and Innovation Medical Research Council [MR/T0333771].

Data availability

Not applicable.

Code availability

Not applicable.

Declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Ethics approval

Not applicable.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Bauermeister S, Orton C, Thompson S, Barker RA, Bauermeister JR, Ben-Shlomo Y, et al. the dementias platform UK (DPUK) data portal. Eur J Epidemiol. 2020;35(6):601–611. doi: 10.1007/s10654-020-00633-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.The Global Alzheimer’s Association Interactive network (GAAIN) [11/04/2022]. Available from: https://gaain.org/
3.Dementias Platform Australia (DPAU) [11/04/2022]. Available from: https://www.dementiasplatform.com.au/
4.Alzheimer's Disease Workbench [08/02/2022]. Available from: https://www.alzheimersdata.org/ad-workbench
5.European Medical Framework for Alzheimer’s Diesease (EMIF-AD) [08/02/2022]. Available from: http://www.emif.eu/
6.Maelstrom Research [25/07/2022]. Available from: https://www.maelstrom-research.org/
7.Integrative Analysis of Longitudinal Studies of Aging (IALSA) [08/02/2022]. Available from https://www.ialsa.org/
8.Closer: The home of longitudinal research [25/07/2022]. Available from: https://www.closer.ac.uk/
9.Adhikari K, Patten SB, Patel AB, Premji S, Tough S, Letourneau N. Data harmonixation and data pooling from cohort studies: a practical approach for data management. Int J Popul Data Sci. 2021;30(1):1680. doi: 10.23889/ijpds.v6i1.1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Almeida JR, Silva LB, Bos I, Visser PJ, Oliviera JL. A methodology for cohort harmonisation in multcentre clinical research. Inf Med Unlocked. 2021;27:1007–1060. [Google Scholar]
11.Bergeron J, Massiccotte R, Atkinson S, Bocking A, Fraser W, Fortier I. Cohort profile: research advancement through cohort cataloguing and harmonization (ReACH) Int J Epidemiol. 2020;50(2):396–397i. doi: 10.1093/ije/dyaa207. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.de Moira AP, Haakma S, Strandberg-Larsen K, van Enckevort E, Kooijman M, Cadman T, et al. The EU child cohort network’s core data: establishing a set of findable, accessible, interoperable and re-usable (FAIR) variables. Int J Epidemiol. 2021;36:565–580. doi: 10.1007/s10654-021-00733-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.O’Connor M, Moreno-Betancur M, Goldfield S, Wake M, et al. Data resource profile: melbourne children’s lifecourse initiative (lifecourse) Int J Epidemiol. 2022;51(5):e229–e244. doi: 10.1093/ije/dyac086. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Bauermeister S, Bauermeister J, Bridgman R, Felici C, Newbury N, North L et al.(2023) Research-ready data: the C-Surv data model. (in press) Eur J Epidemiol [DOI] [PMC free article] [PubMed]
15.The OMOP common data model. [25/07/2022]. Available from: https://www.ohdsi.org/data-standardization/the-common-data-model/
16.CDISC: Clinical Data Interhcange Standards Consortium. [25/07/2022]. Available from: https://www.cdisc.org/
17.Fortier I, Raina P, Van den Heuvel ER, Griffith LE, Craig C, Saliba M, et al. Maelstrom Research guidelines for rigorous retrospective data harmonization. Int J Epidemiol. 2017;46(1):103–105. doi: 10.1093/ije/dyw075. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Elliott P, Vergnaud AC, Singh D, Neasham D, Spear J, Heard A. The airwave health monitoring study of police officers and staff in Great Britain: rationale, design and methods. Environ Res. 2014;134:280–285. doi: 10.1016/j.envres.2014.07.025. [DOI] [PubMed] [Google Scholar]
19.Steptoe A, Breeze E, Banks J, Nazroo J. Cohort profile: the English longitudinal study of ageing. Int J Epidemiol. 2013;42(6):1640–1648. doi: 10.1093/ije/dys168. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Smith BH, Campbell A, Linksted P, Fitzpatrick B, Jackson C, Kerr SM, et al. Cohort profile: generation Scotland: Scottish family health study (GS:SFHS). The study, its participants and their potential for genetic research on health and illness. Int J Epidemiol. 2013;42(3):689. doi: 10.1093/ije/dys084. [DOI] [PubMed] [Google Scholar]
21.Dufouil C, Dubois B, Vellas B, Pasquier F, Blanc F, Hugon J, et al. Cognitive and imaging markers in non-demented subjects attending a memory clinic: study design and baseline findings of the MEMENTO cohort. Alzheimers Res Ther. 2017;9(1):67. doi: 10.1186/s13195-017-0288-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.The DPUK Data Portal. [25/07/2022]. Available from: https://portal.dementiasplatform.uk/
23.All of Us Research Program. [25/07/2022]. Available from: https://allofus.nih.gov/
24.UK Biobank. [25/07/2022]. Available from: https://www.ukbiobank.ac.uk/
25.EPIC Norfolk. [25/07/2022]. Available from: https://www.epic-norfolk.org.uk/
26.MRC National Study of Health and Development (MRC NSHD). [25/07/2022]. Available from: http://skylark.ucl.ac.uk
27.Sherry, et al. dbSNP: database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res. 1999;1999(9):677–679. doi: 10.1101/gr.9.8.677. [DOI] [PubMed] [Google Scholar]
28.International Standard Classification of Education (ISCED) [03/01/2023}. Available from: https://ilostat.ilo.org/resources/concepts-and-definitions/classification-education/

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary file1 (DOCX 62 KB)^{(61.5KB, docx)}

Supplementary file2 (DOCX 14 KB)^{(13.9KB, docx)}

Data Availability Statement

Not applicable.

[CR1] 1.Bauermeister S, Orton C, Thompson S, Barker RA, Bauermeister JR, Ben-Shlomo Y, et al. the dementias platform UK (DPUK) data portal. Eur J Epidemiol. 2020;35(6):601–611. doi: 10.1007/s10654-020-00633-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.The Global Alzheimer’s Association Interactive network (GAAIN) [11/04/2022]. Available from: https://gaain.org/

[CR3] 3.Dementias Platform Australia (DPAU) [11/04/2022]. Available from: https://www.dementiasplatform.com.au/

[CR4] 4.Alzheimer's Disease Workbench [08/02/2022]. Available from: https://www.alzheimersdata.org/ad-workbench

[CR5] 5.European Medical Framework for Alzheimer’s Diesease (EMIF-AD) [08/02/2022]. Available from: http://www.emif.eu/

[CR6] 6.Maelstrom Research [25/07/2022]. Available from: https://www.maelstrom-research.org/

[CR7] 7.Integrative Analysis of Longitudinal Studies of Aging (IALSA) [08/02/2022]. Available from https://www.ialsa.org/

[CR8] 8.Closer: The home of longitudinal research [25/07/2022]. Available from: https://www.closer.ac.uk/

[CR9] 9.Adhikari K, Patten SB, Patel AB, Premji S, Tough S, Letourneau N. Data harmonixation and data pooling from cohort studies: a practical approach for data management. Int J Popul Data Sci. 2021;30(1):1680. doi: 10.23889/ijpds.v6i1.1680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Almeida JR, Silva LB, Bos I, Visser PJ, Oliviera JL. A methodology for cohort harmonisation in multcentre clinical research. Inf Med Unlocked. 2021;27:1007–1060. [Google Scholar]

[CR11] 11.Bergeron J, Massiccotte R, Atkinson S, Bocking A, Fraser W, Fortier I. Cohort profile: research advancement through cohort cataloguing and harmonization (ReACH) Int J Epidemiol. 2020;50(2):396–397i. doi: 10.1093/ije/dyaa207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.de Moira AP, Haakma S, Strandberg-Larsen K, van Enckevort E, Kooijman M, Cadman T, et al. The EU child cohort network’s core data: establishing a set of findable, accessible, interoperable and re-usable (FAIR) variables. Int J Epidemiol. 2021;36:565–580. doi: 10.1007/s10654-021-00733-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.O’Connor M, Moreno-Betancur M, Goldfield S, Wake M, et al. Data resource profile: melbourne children’s lifecourse initiative (lifecourse) Int J Epidemiol. 2022;51(5):e229–e244. doi: 10.1093/ije/dyac086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Bauermeister S, Bauermeister J, Bridgman R, Felici C, Newbury N, North L et al.(2023) Research-ready data: the C-Surv data model. (in press) Eur J Epidemiol [DOI] [PMC free article] [PubMed]

[CR15] 15.The OMOP common data model. [25/07/2022]. Available from: https://www.ohdsi.org/data-standardization/the-common-data-model/

[CR16] 16.CDISC: Clinical Data Interhcange Standards Consortium. [25/07/2022]. Available from: https://www.cdisc.org/

[CR17] 17.Fortier I, Raina P, Van den Heuvel ER, Griffith LE, Craig C, Saliba M, et al. Maelstrom Research guidelines for rigorous retrospective data harmonization. Int J Epidemiol. 2017;46(1):103–105. doi: 10.1093/ije/dyw075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Elliott P, Vergnaud AC, Singh D, Neasham D, Spear J, Heard A. The airwave health monitoring study of police officers and staff in Great Britain: rationale, design and methods. Environ Res. 2014;134:280–285. doi: 10.1016/j.envres.2014.07.025. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Steptoe A, Breeze E, Banks J, Nazroo J. Cohort profile: the English longitudinal study of ageing. Int J Epidemiol. 2013;42(6):1640–1648. doi: 10.1093/ije/dys168. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Smith BH, Campbell A, Linksted P, Fitzpatrick B, Jackson C, Kerr SM, et al. Cohort profile: generation Scotland: Scottish family health study (GS:SFHS). The study, its participants and their potential for genetic research on health and illness. Int J Epidemiol. 2013;42(3):689. doi: 10.1093/ije/dys084. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Dufouil C, Dubois B, Vellas B, Pasquier F, Blanc F, Hugon J, et al. Cognitive and imaging markers in non-demented subjects attending a memory clinic: study design and baseline findings of the MEMENTO cohort. Alzheimers Res Ther. 2017;9(1):67. doi: 10.1186/s13195-017-0288-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.The DPUK Data Portal. [25/07/2022]. Available from: https://portal.dementiasplatform.uk/

[CR23] 23.All of Us Research Program. [25/07/2022]. Available from: https://allofus.nih.gov/

[CR24] 24.UK Biobank. [25/07/2022]. Available from: https://www.ukbiobank.ac.uk/

[CR25] 25.EPIC Norfolk. [25/07/2022]. Available from: https://www.epic-norfolk.org.uk/

[CR26] 26.MRC National Study of Health and Development (MRC NSHD). [25/07/2022]. Available from: http://skylark.ucl.ac.uk

[CR27] 27.Sherry, et al. dbSNP: database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res. 1999;1999(9):677–679. doi: 10.1101/gr.9.8.677. [DOI] [PubMed] [Google Scholar]

[CR28] 28.International Standard Classification of Education (ISCED) [03/01/2023}. Available from: https://ilostat.ilo.org/resources/concepts-and-definitions/classification-education/

PERMALINK

Evaluating the harmonisation potential of diverse cohort datasets

Sarah Bauermeister

Mukta Phatak

Kelly Sparks

Lana Sargent

Michael Griswold

Caitlin McHugh

Mike Nalls

Simon Young

Joshua Bauermeister

Paul Elliott

Andrew Steptoe

David Porteous

Carole Dufouil

John Gallacher

Abstract

Supplementary Information

Background

Methods

Variable selection

Standardisation

Schema development

Schema evaluation

Results

Core variables

Table 1.

Representation and distribution

Table 2.

Fig. 1.

Utility

Table 3.

Discussion

Supplementary Information

Acknowledgements

Author contributions

Funding

Data availability

Code availability

Declarations

Conflicts of interest

Consent to participate

Consent for publication

Ethics approval

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases