. 2020 May 20;47(2):218–226. doi: 10.1080/03014460.2020.1742379

Table 3.

Methods for estimating rates and distributions of linkage error.

Method	Description	Example	Requirements
Manual review	Manual inspection of record pairs is used to make a decision about whether two records belong to the same individual or not, based on similarities of identifiers held in those records. Humans may recognise small differences between identifiers that may not have been fully captured in an automated linkage strategy (e.g. recognising that Beth is a derivative of Elizabeth, or that December 31 1999 is close to January 01 2000).	Manual review is routinely use at the Centre for Health Record Linkage (CHeReL; New South Wales Ministry of Health). (Centre for Health Record Linkage 2012). CHeRel hold linked records from a number of administrative datasets, including records of hospitalisations, emergency department presentations, births, cancer registrations and deaths. Manual review of a subsample of the 114 Million Brazil cohort has been used to generate training data to inform machine learning approaches to assessing linkage quality (Pita et al. 2017).	Access to identifiers
Applying a linkage algorithm to a subset of (gold-standard) data	Testing a linkage strategy on a sample of data where the true match status is known can provide estimates of linkage error rates. “Gold-standard” or “training” datasets might come from a subset of data where a unique identifier is available, where manual review can be performed on a sample of data, or where external information is available. If a subsample is used, it should be representative of the quality of the main dataset.	Linkage of admission records for children in intensive care with laboratory records from infection surveillance systems. In this study, 2 of the 22 laboratories were able to provide high quality, complete and unique identifiers that were used to create a gold-standard subsample. (Harron et al. 2013).	Access to identifiers within a gold-standard or training dataset
Applying a linkage algorithm to “negative controls”	Testing a linkage strategy on a subset of data we are sure should not link (i.e. data from two unrelated populations) can be a convenient way of identifying false match rates.	Linking birth records to hospital records for pregnant women known to have had an abortive outcome (i.e. where no birth record should exist). (Paixão et al. 2019).	Access to identifiers
Identification of implausible scenarios	False matches can be identified in cases where (non-identifiable) information in the records mean it is unlikely that two records belong to the same individual (e.g. a male patient being admitted for a caesarean section, or an admission following a death). In cases where we expect there to be a maximum of one match per record (e.g 1:1 or many:1 matching), multiple matches per record will indicate one or more false matches. Identifying false matches in these ways can provide a minimal estimate of the false match rate.^*	Identifying false matches through implausible sequences of events in hospital data, e.g. multiple admissions on the same day in different parts of the country.(Hagger-Johnson et al. 2015) Estimating the number of false matches by counting the number of duplicate matches between Census and mortality records. (Blakely and Salmond 2002).	Access to attribute data and knowledge about potential implausible scenarios
Comparison of linked and unlinked records	In a “Master” or “Nested” structure where we expect all cohort records to link, the number of missed matches can be estimated as the number of records that failed to link. In other linkage structures, sometimes a known subset of cohort members will be expected to link (“positive controls”), for example when information about disease or vital status can be obtained from other data sources as well as linkage to patient/death registers. In these cases, linked and unlinked members of the subset can be compared.^**	Comparing the characteristics of linked and unlinked maternal and baby hospital records in New South Wales.(Ford et al. 2006) Linkage of a subset of prisoners known to have died in prison: after linkage to a register of deaths, the match rate among this subgroup was assessed and the characteristics of linked and unlinked records in this subset could have been compared. (Moore et al. 2014).	Access to unlinked records with attribute data
Comparison of records with high versus low quality identifiers	Records with missing or invalid identifiers may be less likely or even impossible to link in many applications, so comparing the distribution of identifier quality with respect to variables of interest can provide information about the minimum number of missed links (those with insufficient data for linkage) and the likely distribution of missed links with respect to variables of interest.	Comparison of records with and without a valid NHS number for linkage of tuberculosis case notifications and a laboratory database of all culture positive isolates from tuberculosis reference laboratories. (Aldridge et al. 2015).	Access to record-level or aggregate indicators of identifier quality and attribute data
Comparisons with external data sources	In situations where the expected number of links is not known a priori, comparing the characteristics of a linked sample of records with other representative data can help identify whether the linked records are broadly representative, or whether linkage errors might have contributed to observed differences.	Comparing the characteristics of a cohort of linked mother-baby records with national published statistics on birth characteristics. (Harron et al. 2016).	Access to attribute data only

More false matches might be present, but unidentified.

^**

Technically, if there are false matches, there will also be additional missed matches (i.e. if a link is made to the wrong record, it will not appear as a missed match, but we will have missed the correct match).