Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2006;2006:779–783.

Record Linkage: Making the Most Out of Errors in Linking Variables

M Tromp 1,3, JB Reitsma 2, ACJ Ravelli 1, N Méray 1, GJ Bonsel 3
PMCID: PMC1839331  PMID: 17238447

Abstract

This paper presents a refinement of the probabilistic medical record linking algorithm. We introduced “close agreement” to account for typical errors in administrative variables used for record linkage. Linking data on early pregnancy determinants with data on late child outcomes was used as a case study. We analyzed whether the addition of close agreement resulted in a higher discriminating power of the linking key reflected in a reduction of the number of links with an uncertain linking status. Incorporating close agreement for postal code and date of birth in the record linking algorithm resulted in a reduction of 95% of the number of pairs in the uncertain region. We showed that the extension of a third outcome “close” when comparing values of corresponding linking variables led to a major improvement in our probabilistic record linkage study. Similar improvements are likely in other studies because the frequency, nature, and type of errors in other large databases will not be substantially different.

Introduction

Medical record linkage techniques are commonly applied when data from different sources, if combined, provide an answer to a clinical or public health question. For instance, linking electronic health records of patients stored in multiple databases and registries. In the absence of a unique identifier in these data sources, medical record linkage techniques are used to identify records belonging to the same individual.1 Partly identifying variables present in both datasets are combined to create a powerful discriminating, yet anonymous, linking key.

Probabilistic medical record linkage strategies use the information value of the linking variables by assigning a weight for agreement and a weight for disagreement for every variable separately. These weights are based on two probabilities: the probability that a variable agrees among matches (mi) and the probability that a variable agrees among non-matches (ui). The mi value reflects the reliability of the variable (error rate), while the ui value reflects the discriminating power of the variable (chance agreement).1;2 These are the two fundamental concepts in record linkage: error rate and discriminating power. Together they influence the number of incorrect decisions in a linking procedure; a low discriminating power increases the number of false links and a high error rate increases the number of false non-links. The discriminating power of the linking key is directly correlated with the probability that two different persons have the same values on the linking key. The (theoretical) number of possible values of the linking variables, and their (empirical) distribution determines this probability. A variable with a large number of possible values and a uniform distribution has a high discriminating power, reflected by a low ui probability. Errors in variables will lead to a difference in the value of a corresponding variable among matches thereby lowering the mi probability.

Differences in values between corresponding variables can have many underlying causes: errors created during collection of the data, during the data entry or true changes of a value. Making errors depends on the type of variable, the type of data collection and the way of data entry. If the occurrence or the origin of errors has a certain pattern and is therefore predictable, this knowledge can be incorporated in the record linking algorithm to improve the outcome of the linkage. For example: a date variable is more prone to error than a dichotomous variable (predictability based on type of variable). Data entry errors might occur more often at the end of the day than at the beginning of the day (predictability in occurrence). In case of error, the new value is often dependent on the original value (predictability in origin). This last type of predictability of errors in linking variables is the topic of this paper. This dependency can exist in many different forms: the literal distance (birth weight weighted on two scales), but also a figurative distance (different spelling of the same name, keyboard errors, transposition of day and month in a date). Soundex algorithms have been used to account for differences in spelling of names.3 Partial agreement on linking variables has been used in the manual review process of a linkage procedure or as a further refinement to resolve ties among links.4;5

A typical example of a relevant clinical question asking for combined data is to investigate the relation between early determinants and late outcomes. Profession based medical registries do not always cover this wide horizon, because they are limited to the care provided. We use linking data on pregnancy determinants and child outcome (neonatal and infant mortality) as a case study to account for errors in linking variables.

In the Netherlands data on perinatal care are registered in four profession-based anonymous national registries: the LVR1-registry (midwives), the LVRh-registry (GP’s) the LVR2-registry (obstetricians) and the LNR-registry (pediatricians and neonatologists). Low level care is provided by midwives/general practitioners and high level care by obstetricians. The level of care might change (in two directions) during pregnancy, delivery and/or postpartum period. Admissions of newborns are recorded until 28 days after term date in the registry of pediatricians and neonatologists. The perinatal registries have been combined using probabilistic medical record linkage strategies to form one perinatal registry with records containing combined information on mother, pregnancy, birth, the postpartum period and the child.6

The Dutch Population Register is a national assembly of all municipal registers and contains information on address and family relations of Dutch residents. A liveborn is registered along with its parents. The birth and mortality statistics of Statistics Netherlands are based on this Population Register. The causes of death register is legally held by Statistics Netherlands. In addition, a separate national register exists of stillborns of >=24 weeks. Statistics Netherlands combines this register with information from the causes of death register for the statistics on stillborns. Because of privacy laws in the Netherlands the population and stillborn registers are anonymous on a national level.

In a pilot study (2005–2006), we tried to combine the perinatal registry data and the population register data (including mortality statistics) in order to produce valid figures on long term outcomes neonatal (1st month) and infant (1st year mortality) by early determinants (gestational age and birth weight) for the Netherlands. To overcome the limited discriminating power of our linking key, we introduced “close agreement” to detect and account for typical errors in administrative variables. “Close agreement” is the situation where two values are not in perfect agreement, but are close to each other. The definition of “close” depends on the nature of the involved variable (for instance comparing dates of birth versus comparing birth weights) and on knowledge about the characteristics of the entry procedure, e.g. the type of error checks that are in place. We then investigated if the discriminating power of the linking key increased by adding close agreement, reflected by a reduction of the number of links with an uncertain linking status.

Methods

We used the data files of the year 2001 for this study as this was the first available linked perinatal registry file (indicated by date of birth child). Names were not included. The perinatal registry contained 202,904 records on 188,628 liveborns and 1,545 stillborns for 2001. The remaining records did not include pregnancy outcome. The data files of the national population register and the stillborn register that we used contained records of 205.525 liveborns and 1,261 stillborns for 2001.

We started by building a basic probabilistic record linkage algorithm that only includes two options with respect to comparing corresponding linking variables: variables either agree or disagree. Because linking variables differed considerably with respect to their discriminatory value and their likelihood of containing error, we calculated different mi and ui values for each variable. Ui probabilities were calculated from the marginal distribution in the two files as true non-matches make up the largest part of the total number of pairs. Because the matching status is unknown, mi values were estimated using the Expectation Maximization (EM) algorithm with the observed patterns of agreement and disagreement of the singleton files.7;8 We excluded records with missing values on one of the linking variables when estimating weights, because otherwise pairs with missing values in both records will be included in the weight estimation of agreement. If the outcomes of the comparisons are independent between variables, the total log likelihood can be written as:

pn(γp){log(πi=1kmiγip(1-mi)1-γip+(1-π)i=1kuiγip(1-ui)1-γip)} (Equation 1)

where mi is the probability of agreement of the ith variable among matches, ui is the probability of agreement among non-matches, is the proportion of true matches among all possible record combinations, np) the number of record pairs with pattern γ, γip is the outcome of the comparison of variable i in the pattern p, for i = 1,…,k and p = 1,…,2k.

Using the mi and ui probabilities calculated in equation 1, we calculate the weights for agreement and disagreement as:

Agreement weight of the ith variable = log2miui, Disagreement weight of the ith variable = log21-mi1-ui.

To formally incorporate our prior knowledge on errors we defined a so called “close” agreement. The underlying assumption is that close agreement is more likely to occur among matches if an error is being made than can be expected by chance alone. Thus, close agreement is a third possible outcome of variable comparison.

Incorporating this third outcome option into equation 1, leads to the following definition of the total log likelihood:

pn(γp){log(πi=1kmfiγipmciγip(1-mfi-mci)1-γip+(1-π)i=1kufiγipuciγip(1-ufi-uci)1-γip)}

where mfi is the probability of full agreement of the ith variable among matches, mci is the probability of close agreement among matches, ufi is the probability of full agreement among non-matches and uci is the probability of close agreement among non-matches. The weight for close agreement is calculated the same way as the weight for full agreement: log21-mci1-uci

The weight for disagreement can now be calculated as: log21-mfi-mci1-ufi-uci

For every record pair a total linkage weight is calculated by adding up all the individual variable weights. Linking weight was set to zero if a variable was missing in one or both records compared. All pairs were sorted by their total weight and a threshold value was determined separating links from non-links based on the estimated match rate by the EM algorithm and by reviewing pairs around this estimated threshold value. The small area around the threshold is called the ‘grey area’ because of the uncertainty of linking status in this area (figure 1). Pairs with a weight higher than the upper boundary of the grey area are certain links and under the lower boundary are certain non-links.

Figure 1.

Figure 1

Number of pairs by their total linking weight for matches and non-matches with and without close agreement (fictional data).

If adding close agreement for linking variables increases the discriminating power of the linking key and thus better separates the matches from non-matches, the number of pairs within the grey area should decrease (depicted by the dotted line in figure 1).

The six available linking variables were: date of birth mother, postal code mother (4 digits), date of birth child, gender child, multiple birth status (number of children per pregnancy) and mortality. Multiple birth status was used to split both files as we conducted the linkage separately for singletons and multiple births. Only singleton records were selected for this paper on the potential utility of close. Death is rare and therefore we could not use this variable as a linking variable, but instead we used it afterwards as a control. Because of large file sizes, blocking was mandatory (variable date of birth mother). In a second step we used postal code as blocking variable to account for errors in date of birth mother (results not shown). We estimated and compared the mi and ui probabilities for the linking variables in the situation without and with close agreement. Within the close approach, we compared several options and the absolute and relative differences in mi and ui values between exact agreement, close, and disagreement. Of all these differences, we examined which outcome had the biggest impact on the size of the grey area.

Results

Table 1 shows the mi and ui probabilities and the associated linking weights for agreement and disagreement for postal code (4 digits), date of birth child and gender. Postal code has the highest discriminating power resulting in a weight for agreement of 10.76.

Table 1.

Reliability (mi probability) and discriminating power (ui probability) with associated weights, for linking variables to link the perinatal registry with the population register for the year 2001.

Variable mi ui Agree Disagree
DoB mother Blocking
Postal code 0.9592 0.00055 10.76 −4.61
DoB child 0.9893 0.00278 8.48 −6.54
Gender child 0.9960 0.50010 0.99 −6.93

DoB = date of birth

The weight for agreement for gender is only 0.99 because of the low discriminating power of gender. The maximum linking weight is 20.23 if all variables agree. Based on the estimated prevalence of matches we set the threshold value at a total weight of 7, grey area ranging from 4 to 10. The total number of pairs we assigned as matches was 173,875 pairs.

Table 2 shows three different types of close agreement for postal code with their ui and mi probabilities. The first type of close agreement allows one of the four digits to differ (typing error). There are now three possible outcomes of variable comparison: full agreement, close agreement and disagreement. This close value shows a change in linking weight for values being close from disagreement weight of −4.61 to close agreement weight of +1.50 (increase in linking weight of +6.11). The weight for disagreement changes from −4.61 to −5.56, as disagreement is now everything besides full agreement and close agreement (decrease in linking weight of −0.95). The second type of close agreement allows only the fourth digit to differ, with the assumption that errors in the last digit are less apparent than errors in one of the other digits. This close agreement shows a change in linking weight for values being close of +7.59 and for disagreement of −0.7. The third type of close agreement allows the reversal of two digits within the four digit postal code.

Table 2.

Reliability (mi probability) and discriminating power (ui probability) of three types of close agreement for postal code.

Variable mi ui (close) Agree Disagree
Postal code (4 digits) 0.95920 0.00055 10,76 −4.61
(1) Postal code
1 fo 4 digits differs
0.95920
0.01970
0.00059
0.00695
10.66
1.50*
−5.56
(2) Postal code
4th digit differs
0.95910
0.01580
0.00059
0.00200
10.66
2.98*
−5.31
(3) Postal code
1st and 2nd digit rvsd
2nd and 3rd digit rvsd
3rd and 4th digit rvsd
0.95920
0.00009
0.00008
0.00086
0.00059
0.00013
0.00019
0.00013
10.66
−0.56*
−1.29*
2.76*
−4.65


*

Close agreement weights

Reversal of the third and fourth digit shows the largest shift of +7.37 for values being close and −0.04 for disagreement. This can be explained by the fact that an error in the last two digits (indicating the neighborhood) is less apparent than an error in the first two digits (indicating the region in the Netherlands).

Table 3 shows close agreement types for the variable date of birth child: a difference of 1 day, 2 days and 1 month, with the ui and mi probabilities. The close agreement of 1-day difference is the only close agreement with a high discriminating power: the weight for being close is changed by +6.94 and the punishment for disagreement by −1.57.

Table 3.

Reliability (mi probability) and discriminating power (ui probability) of three types of close agreement for date of birth child.

mi ui (close) Agree Disagree
DoB child 0.98930 0.00278 8.47 −6.54
(1) DoB child± 1 day 0.98920
0.00722
0.00278
0.00549
8.48
0.40*
−8.11
(2) DoB child
± 2 days
0.98930
0.00064
0.00278
0.00541
8.48
−30.8*
−6.62
(3) DoB child
± 1 month
0.98930
0.00069
0.00278
0.00485
8.48
−2.82*
−6.63
*

Close agreement weights, DoB = date of birth

Table 4 finally demonstrates the benefit in terms of grey area reduction of the best close agreement of Table 2 and 3. In this example we fixed the threshold at a total weight of 7, with a grey area range of 4–10. Adding close agreement for date of birth child of a 1-day difference reduces the grey area by 24%. Adding close agreement for postal code of a difference in the fourth digit or a reversal of the third and fourth digit reduces the grey area by 71%. When combined, the addition of these close agreements reduces the number of pairs in the grey area by 95%. The addition of close agreement for date of birth child and postal code lifted 1,660 links from the uncertain linked area over the upper boundary of the grey area (10%).

Table 4.

Grey area reduction after adding close agreement for date of birth child, postal code and for date of birth and postal code for the linking of singleton files.

Linking strategy From uncertain not linked (weight 4–7) to certain not linked (weight <4) % Grey area (uncertain region) weight 4–10* % From uncertain linked (weight 7–10) to certain linked (weight >10) % Number of links weight >7
Basis probabilistic linkage
na

16,465
100%
na

173,875
Close agreement for date of birth child (+/− 1 day) 3,504 21% 12,523 76% 438 3% 174,313
Close agreement for postal code (error in 4th digit, reversal of 3rd and 4th digit) 10.417 63% 4,826 29% 1,222 7% 175,099
Close agreement for date of birth and postal code 13,908 84% 897 5% 1,660 10% 175,537

na = not applicable

*

The number of record pairs in the grey area with basic probabilistic linkage was set to 100%.

Discussion

This paper shows, within the context of linking perinatal data, that the addition of “close agreement” as a third outcome in a probabilistic linkage algorithm substantially improves the performance. Incorporating close agreement in the linkage algorithm corrects for typical errors in linking variables. Similar improvements are likely to be expected in other studies because the frequency, nature, and type of errors in other large databases will not be substantially different.

Although prior knowledge is required to define close, the testing of the added value (if any) could be executed along existing statistical procedures. The same formal criteria could be applied to estimate the mi and ui probabilities for close agreement. The gain of close could be expressed as a grey area reduction which of course involves an arbitrarily decision regarding its boundaries. The surplus value of incorporating close agreement will be most apparent in situations with few linking variables that are sensitive for errors.

So far, no manual validation has been conducted for this pilot study. For previous conducted linking studies using a similar approach, validation showed good results.9 The linking weights were very stable regardless of perfect or close agreement and misclassification of links will be about the same (< 1%). Adding close agreement requires prior knowledge. For content variables if used as linking variable, e.g. birth weight, this includes knowledge on habits and customs as close is not random. It is still arbitrarily when to include additional types of close agreement, therefore further research will focus on deciding for optimal close values for linking variables.

The grey area reduction is a simple device to demonstrate the added value of close agreement. Application of close agreement can be anticipated in many circumstances replacing manual or semi-automatic record by record reviewing.

Conclusion

The extension of existing dichotomous perfect agreement/disagreement with a third category “close” and incorporation in the record linking algorithm represents an improvement in probabilistic medical record linking expressed in a grey area reduction of 95% in our example.

Acknowledgements

We gratefully acknowledge the support and funding of the SPRN (Foundation of the Dutch Perinatal Registry), the partners in the pilot study of Statistics Netherlands (W.P. Schaasberg, A. de Bruin and M. Bergervan Sijl) and the investment of numerous caregivers in the Netherlands.

References

  • 1.Newcombe HB. Oxford: Oxford University Press; 1988. Handbook of record linkage: methods for health and statistical studies, administration and business. [Google Scholar]
  • 2.Bell RM, Keesey J, Richards T. The urge to merge: linking vital statistics records and Medicaid claims. Med Care. 1994;32(10):1004–1018. [PubMed] [Google Scholar]
  • 3.Newcombe HB, Fair ME, Lalonde P. Discriminating powers of partial agreements of names for linking personal records. Part I: The logical basis. Methods Inf Med. 1989;28(2):86–91. [PubMed] [Google Scholar]
  • 4.Jamieson E, Roberts J, Browne G. The feasibility and accuracy of anonymized record linkage to estimate shared clientele among three health and social service agencies. Methods Inf Med. 1995;34(4):371–377. [PubMed] [Google Scholar]
  • 5.Roos LL, Wajda A. Record linkage strategies. Part I: Estimating information and evaluating approaches. Methods Inf Med. 1991;30(2):117–123. [PubMed] [Google Scholar]
  • 6.Meray N, Reitsma JB, Ravelli ACJ, Bonsel GJ. Probabilistic medical record linkage in the absence of a patient identification number - linkage of the Dutch perinatal registries. Accepted for publication in Journal of Clinical Epidemiology [Google Scholar]
  • 7.Fellegi IP, Sunter AB. A theory for record linkage. Journal of the American Statistical Association. 1969;64(328):1183. [Google Scholar]
  • 8.Reitsma JB. Registers in Cardiovascular Epidemiology. PhD thesis Academic Medical Center, University of Amsterdam ed; Amsterdam: 1999. [Google Scholar]
  • 9.Bonsel GJ, Ravelli ACJ, Reitsma JB, Meray N. Validation linking procedure PRN 2001 Empirical validation of midwives registry (LVR1) and obstetricians registry (LVR2) linkage (in Dutch) KIK Technical Report 2004-01 2004. http://kik.amc.uva.nl/KIK.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES