‘Gold standard’ or reference data |
• Data where the true match status is known, used to test linkage algorithms and estimate rates of linkage error. |
• Typically based on a subsample of records that have been manually reviewed, an additional data source with complete identifiers, a representative synthetic dataset, or external reference rates for the population of interest |
^ For example, comparison of mortality rates based on linkage of death registrations versus national figures (Schmidlin et al., 2013) or comparison of infection rates within a subset of validated data (Harron et al., 2013, Paixao et al., in press). |
Post-linkage data validation |
• Used to estimate minimum false-match rates by identifying implausible scenarios within the data. |
^ For example, linkage of a hospital admission record following a known date of death could indicate a false-match; as could linkage of multiple death records to a single census record (Blakely and Salmond, 2002; Hagger-Johnson et al., 2014). |
Sensitivity analyses |
• Used to assess the extent to which results vary according to different linkage criteria. |
• Could involve changing the linkage algorithm or changing the threshold within probabilistic linkage, and re-running analyses to evaluate any impact on results (Lariscy, 2011). |
^ For example, comparing results over a range of match weights could help identify the direction of the effect of linkage errors on outcomes of interest (Moore et al., 2014). |
Comparing characteristics of linked and unlinked data |
• Used to identify any differences in linkage rates for different subgroups of individuals. |
^ For example, comparing rates of preterm birth in linked and unlinked maternity records (Ford et al., 2006; Harron et al., 2016). |
• Where not all records are expected to match, distributions of variables in the linked data can be compared to external sources (e.g. age and/or ethnic group distributions from national census data) to explore any evidence of selection bias (Harron et al., 2016). |