Use of Commercial Record Linkage Software and Vital Statistics to Identify Patient Deaths

Thomas B Newman; Andrew N Brown

doi:10.1136/jamia.1997.0040233

. 1997 May-Jun;4(3):233–237. doi: 10.1136/jamia.1997.0040233

Use of Commercial Record Linkage Software and Vital Statistics to Identify Patient Deaths

Thomas B Newman ¹, Andrew N Brown ¹

PMCID: PMC61238 PMID: 9147342

Abstract

We evaluated the ability of a microcomputer program (Automatch) to link patient records in our hospital's database (N = 253,836) with mortality files from California (N = 1,312,779) and the U.S. Social Security Administration (N = 13,341,581). We linked 96.5% of 3,448 in-hospital deaths, 99.3% for patients with social security numbers. None of 14,073 patients known to be alive (because they were subsequently admitted) was linked with California deaths, and only 6 (0.1%) of 6,444 were falsely identified as dead in the United States file. For patients with unknown vital status but items in the database likely to be associated with high 3-year mortality rates, we identified death records of 88% of 494 patients with cancer metastatic to the liver, 84% of 164 patients with pancreatic cancer, and 91% of 126 patients with CD4 counts of less than 50. Hospital data can be accurately linked with state and national vital statistics using commercial record linkage software.

Large administrative databases are increasingly being used to compare mortality across hospitals. However one limitation of these databases is that they include only in-patient mortality. Information on out-of-hospital deaths can be obtained on a patient-by-patient basis from centralized death certificate registries, such as the National Death Index or the Canadian Mortality Database. However, the turn-around time and charge per record linked make such a procedure unfeasible for hospitals or managed care organizations that wish to link all of their records and create their own databases for research or outcome tracking. In addition, the National Death Index and Society Security Administration records do not provide information on the cause of death; to obtain the cause of death, death certificates must be requested from the states. Because commercial record linkage software and computerized death certificates are now available at relatively low cost (a few thousand dollars total for both), it is becoming increasingly feasible to link hospital and vital statistics records on site. This paper describes our use of a commercial record linkage program to link a large (N = 253,836 patients) clinical database with mortality data from the state of California and the Social Security Administration.

Methods

Data Sources

We wished to link two types of files: a PATIENT file and DEATH files. The patient file came from our previously created database¹; the original source of the information we used for matching was the patient registration system. The death files were obtained from the California Department of Health Services (Deaths of California residents, 1988-93) and from the United States Social Security Administration (United States Deaths: January 1988-May 1995) via a private vendor (CSRA, Inc., Irvine, CA). Variables available for linkage of this file were last name, first name, date of birth and social security number for the Social Security Administration data, and these variables plus middle initial, race, county of residence, and sex for the California data.

Linkage Algorithm

We used Automatch (Matchware Technologies, Inc., Silver Spring, MD) running on a Pentium microcomputer for the record linkage. The algorithm used by Automatch has been described previously.² The process involves a sequence of user-defined linkage runs; records remaining unlinked from run 1 were available for linkage in run 2, and so on. For each run, the user first defines blocking variables, which must match exactly in the two files. Automatch sorts each input file by the blocking variables, and then compares records within each block, based upon values of the other (matching) variables, which need not match exactly. For example, for our first run to match California deaths, we used social security number as a blocking variable, and last name, first name, middle initial, date of birth, sex, race, and county of residence as matching variables.

Within each block, each pair of records is assigned a weight, corresponding to how closely the matching variables match. The weight is calculated as Σ Log₂ (m/u), where m is the probability that two truly matched records would show the observed degree of matching, u is the probability that two truly unmatched records would show the observed degree of matching, and the summation is over all the matching variables. This is best illustrated by an example. We found that if two records were truly matched, the proportion that matched exactly for sex was m = 0.99. On the other hand, the probability that two truly unmatched records would match for sex (i.e., the probability of sex being the same in any two records by chance alone) is u = 0.50. Therefore, the weight assigned to sex when it matches is log₂ (0.99/0.5) = +0.99, and the weight assigned to sex when it does not match is log₂ ((1 - 0.99)/0.5) = -5.64. Since chance agreement on last name is much less likely than chance agreement on sex, the weight assigned to matching last names was much higher. In addition, since chance agreement on uncommon names is less likely than chance agreement on common names, Automatch assigns a higher weight to matching uncommon names. Thus, for example, the weight was +16.85 for uncommon last names, compared with +6.92 for last name “Smith,” and +9.68 for last name Wong. For each pair of records, the weights for each variable are summed to arrive at an overall weight for that pair of records.

Users are required to specify two thresholds for the weights produced by the linkage program. Pairs with weights below the clerical threshold are assumed not to be matched and remain to be matched in subsequent runs with different choices for matching and blocking variables. Pairs with weights above the match threshold are considered matched. Between these thresholds, we inspected the records manually and decided whether to match them. All of these determinations were made blind to actual vital status. In selecting thresholds and determining clerical matches, we aimed to err on the side of not linking unless there was convincing evidence of a true match, so that the primary effect of missing data would be lack of sensitivity rather than lack of specificity. Thus, in adjudicating clerical matches, we were looking primarily for similarities apparent to us that would not have been taken into account by Automatch, such as first name “MARY-JANE” and middle initial “ “(missing) in one record, and first name “MARY” and middle initial “J” in the other.

To minimize run time and avoid false-positive matches, we avoided looking for deaths on patients whose vital status was already known. Thus, we did not look for deaths in years when patients were known to be alive, nor after years when patients were already known to be dead. The exception is samples of patients known to be alive that we submitted to estimate the specificity of the linkage algorithm (see below).

The matching sequence we used is summarized in Table 1. The first run with California data and the only run with United States data were blocked on social security number; 82% of all deaths we eventually identified had exactly matching social security numbers. The second blocking variable was the soundex of the last name.³ Soundex is a simple algorithm to deal with alternative spellings and misspellings by removing vowels and collapsing consonants into six related groups. Subsequent runs were blocked on combinations of soundex of the last name and other variables.

Table 1.

Matching Run Sequence

Run No.	Blocking Variable	Number of New Matches	New Matches as Percent of All Matches (%)
CA-1	Social security number	16495	75.5
CA-2	Soundex (last name), birth date	3430	15.7
CA-3	Soundex (last name), sex	376	1.7
CA-4	Birth date, sex	43	0.2
CA-5	Birth date, sex	9	0.04
US	Social security number	1489	6.8
Total		21,842	100.00
For each run, records matched in previous runs were excluded. Variables not used for blocking were used as matching variables. Run CA-4 differs from CA-5 only in that the “prefix” type of variable was used for names (e.g., Flores-Wilson would match with Flores). For California deaths, variables available for linkage were social security number, last name, first name, middle initial, date of birth, sex, race, and county of residence. For U.S. deaths, only social security number, last name, first name, and date of birth were available.

Open in a new tab

Linkage Validation

Our first task was an evaluation of Automatch for linking patients with known vital status. Sensitivity was estimated from patients who died in the hospital (N = 348). Specificity was estimated separately for California and United States linkage runs. For the California linkage runs, 14,073 patients known to have been admitted in the quarter following the year for which deaths were being matched were submitted for linkage. For the United States runs we looked at deaths before January 1, 1994, of the 6,444 patients with social security numbers who were admitted to the UCSF Medical Center during the first half of 1994.

Our second task was an evaluation of computerized death registries as a method of ascertaining mortality. We used three additional groups of patients identified from other data in our database for whom we postulated a priori a very high mortality risk: patients with cancer metastatic to liver before 1991 (ICD9 = 197.7; N = 494); patients with pancreatic cancer before 1991 (ICD9 = 157.X; N = 164); and patients with CD4 counts less than 50 from August 1, 1991, to December 31, 1991 (the earliest available in the database; N = 126). We excluded from each of these groups of patients those patients with in-hospital deaths (known vital status) and 11 known to be alive after the end of 1993 because of admissions in 1994 or 1995. Patients with diagnoses of both pancreatic cancer and cancer metastatic to liver were placed in the pancreatic cancer group and excluded from the liver metastases group. We then determined the number of deaths identified before 1993 (from California and United States sources both) and after 1993 (from United States data only); the latter served as an additional indicator of the validity of the assumption of a 3-year mortality rate of close to 100%.

Statistical Analysis

We used STATA 4.0 (Stata Corporation, College Station, TX) to calculate binomial exact 95% confidence intervals for proportions. We used Epi-Info 5.0 to calculate the Mantel-Haenszel relative risk and its confidence intervals. We used Microsoft Excel 5.0 for all other calculations.

Results

Of 253,836 patients, we linked 21,842 (8.6%) with records from the death files: 20,353 from the California death tapes (93% of deaths linked), and 1,489 additional deaths from the United States file (7% of deaths linked; see Table 1).

Table 2 shows the linkage results for the 3,448 in-hospital deaths. Sensitivity was close to 99% in all groups when social security number was available, compared with 86.6% when it was missing. Thus, most of the 122 in-hospital deaths that were not linked were missing social security number (N = 102; 84%); 45 (37%) were missing the first name as well. About two-thirds of the in-hospital deaths not linked (N = 80; 66%) were of infants less than 1 year old; most of these infants were less than 1 month old at the time of death.

Table 2.

Sensitivity of Record Linkage for In-hospital Deaths by Presence or Absence of Social Security Number (SSN) and by Demographic Variables

	SSN Present			SSN Missing
	No. of deaths	No. of identified	%	No. of deaths	No. identified	%
Males	1536	1528	99.5	412	357	86.7
Females	1151	1139	99.0	349	302	86.5
White	1624	1612	99.3	293	268	91.5
Nonwhite	884	878	99.3	289	240	83.0
Unknown race	179	177	98.9	179	151	84.4
<1 year	23	23	100.0	469	389	82.9
>=1 year	2663	2643	99.2	292	270	92.5
First name missing	0	0		158	113	71.5
First name not missing	2687	2667	99.3	603	546	90.5
Total	2687	2667	99.3	761	659	86.6

Open in a new tab

The California data, which could be matched on fields other than social security number (SSN), gave us the opportunity to look at the accuracy of SSNs. Of the deaths identified from California tapes, 81% had exactly matching SSNs, 12.2% had one or both SSNs missing, 4.3% had clearly similar SSNs, and 2.4% had very different SSNs. To be certain that linked records with very different SSNs had not been falsely linked, we compared the dates of death for in-patient deaths. Of 81 in-patient deaths linked with California death records in spite of very different SSNs, 79 (98%) had exactly matching death dates; death dates of the other 2 (2%) differed by a single day. Thus, we are confident that the 2.4% of deaths linked with very different SSNs were not falsely linked.

Specificity for the record linkage was nearly perfect (Table 3). None of the 14,073 patients known to be alive who were submitted for linkage with California data was falsely linked, and only 6 of 6,444 such patients were falsely linked to the United States data. Examining these six falsely linked records indicated that the problem lay not with the linkage algorithm but with the data source: name, social security number, and birth date for all six patients were identical in our patient file and in the United States death file. The estimated sensitivity for deaths outside UCSF in patients with high mortality conditions was, as expected, a bit lower than that for in-hospital deaths (Table 4). The proportion of deaths identified through the end of 1993 was 81-86%. Adding deaths from 1994 and early 1995, which were available only from the United States data, increased the proportions of deaths identified, primarily for patients with low CD4 counts; this suggests that our assumption of a 2-year mortality rate of close to 100% for patients with low CD4 counts was excessively pessimistic. The fact that many patients with low CD4 counts were admitted in 1994 provides further evidence that their 2-year mortality rate was well under 100%.

Table 3.

Specificity of the Record Linkage

Data Source for Deaths	Patients Known Alive	Number Linked	Specificity (%)	95% CI (%)
California	14073	0	100.00	99.98-100.00
United States	6,444	6	99.90	99.79-99.97

Open in a new tab

Table 4.

Patient Records Linked to State and National Mortality Files

High-Risk Group	No. at Risk	No. Deaths Identified Through 1993	Sensitivity Deaths Through 1993 (%)	No. additional US Deaths after 1993	Sensitivity Including Late Deaths (%)	95% CI for Final Sensitivity (%)
Liver metastases	494	425	86.0	9	87.9	84.6-90.6
Pancreatic cancer	164	138	84.1	0	84.1	77.6-89.4
CD4 < 50	126	102	80.9	13	91.3	84.9-95.6

Open in a new tab

Discussion

The major determinants for successful linkage are an accurate record linkage algorithm and complete and sufficiently accurate data for the items to be linked. The present study demonstrates that patient records can be linked to computerized death certificates using a commercial microcomputer record linkage program. We found a very high sensitivity (99.3%) for in-patient deaths among the 78% of decedents with social security numbers. Sensitivity was lower in those missing social security numbers, especially those also missing first names, most of whom were infants. Linkage is difficult for infants because of these missing data and because their last name may change from the mother's to the father's between birth and death. As we had intended when we set a high threshold for matching, the specificity for the record linkage was nearly perfect.

We identified deaths occurring outside our medical center in 88% of patients in three high-risk groups. Because both deaths and hospital admissions occurred in these groups more than 3 years after the data putting them at high risk were recorded, we know that our estimate of a 3-year mortality rate of close to 100% was too pessimistic. We nonetheless report on these three groups because we had selected them for a study a priori, and we feel that results obtained by dredging the database for higher mortality conditions might be misleading. Thus, it is likely that the true sensitivity of our linkage program is a little better than what we measured in this study, even including United States deaths from 1994 to early 1995, because some of the patients we failed to link may yet be alive.

This record linkage will be of greatest interest if it is replicated elsewhere. This would allow comparisons of outcomes of care to extend to events beyond the hospital period. An important prerequisite for such interhospital comparisons will be validation of the record linkage procedure, so that differences in the quality of identifying data or in the quality of record linkage are not mistaken for differences in mortality. The procedures presented in this paper provide an example of how such a validation could be done.

Acknowledgments

Data on California deaths were obtained from the Health Data and Statistics branch of the California Department of Health Services. All analyses, interpretations, and conclusions are those of the authors, and not the Health Data and Statistics Branch. Glen Stettin, Andrew Bindman, Jonathan Showstack, Lewis Sheiner, and Janet Easterling provided helpful suggestions.

This work was supported by the Medical Center of the University of California at San Francisco. It was presented May 2, 1996; at the 19th Annual Meeting of the Society of General Internal Medicine, Washington, DC. There are no financial or commercial relationships that might pose a conflict of interest.

References

1.Newman TB, Brown A, Easterling MJ. Obstacles and approaches to clinical database research: experience at the University of California, San Francisco. Proc Annu Symp Comput Appl Med Care. 1994: 568-72. [PMC free article] [PubMed]
2.Jaro M. Probabilistic linkage of large public health data files. Statistics in Medicine. 1995;14: 491-8. [DOI] [PubMed] [Google Scholar]
3.Knuth D. The Art of Computer Programming, vol 3. Reading, MA: Addison-Wesley, 1973.

[ref1] 1.Newman TB, Brown A, Easterling MJ. Obstacles and approaches to clinical database research: experience at the University of California, San Francisco. Proc Annu Symp Comput Appl Med Care. 1994: 568-72. [PMC free article] [PubMed]

[ref2] 2.Jaro M. Probabilistic linkage of large public health data files. Statistics in Medicine. 1995;14: 491-8. [DOI] [PubMed] [Google Scholar]

[ref3] 3.Knuth D. The Art of Computer Programming, vol 3. Reading, MA: Addison-Wesley, 1973.

PERMALINK

Use of Commercial Record Linkage Software and Vital Statistics to Identify Patient Deaths

Thomas B Newman, MD, MPH

Andrew N Brown, MD, MPH

Abstract

Methods

Data Sources

Linkage Algorithm

Table 1.

Linkage Validation

Statistical Analysis

Results

Table 2.

Table 3.

Table 4.

Discussion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Use of Commercial Record Linkage Software and Vital Statistics to Identify Patient Deaths

Thomas B Newman, MD, MPH

Andrew N Brown, MD, MPH

Abstract

Methods

Data Sources

Linkage Algorithm

Table 1.

Linkage Validation

Statistical Analysis

Results

Table 2.

Table 3.

Table 4.

Discussion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases