Skip to main content
Journal of Registry Management logoLink to Journal of Registry Management
. 2023 Dec 1;50(4):138–143.

Using LexisNexis to Improve Social Security Number Information in the New York State Cancer Registry

Baozhen Qiao a,, April A Austin a, Jamie Musco a, Tabassum Insaf a, Maria J Schymura a
PMCID: PMC10945922  PMID: 38504707

Abstract

Background:

Social Security numbers (SSNs) collected by cancer surveillance registries in the United States are used for patient matching, deduplication, follow-up, and linkage studies. However, due to various reasons, a small proportion of patient records have missing or inaccurate SSNs. Recently, New York State Cancer Registry (NYSCR) data have been linked to LexisNexis data to obtain patient demographic information, including SSNs. The current study evaluated the feasibility of using LexisNexis to improve SSN information in the NYSCR.

Materials and Methods:

Patients diagnosed during the years 2005–2016, aged 21 or older, in the NYSCR were linked to LexisNexis data. For the matched patients, LexisNexis returned demographic information, including SSNs as available. Percentages of patients without LexisNexis matches or without LexisNexis SSNs were examined by demographic characteristics. We used multivariate logistic regression analyses to further evaluate how patient demographic characteristics affected the likelihood of no LexisNexis matches or of no SSNs returned. For patients with SSNs returned, LexisNexis SSNs were compared with registry SSNs. If patients had prior missing registry SSNs or if LexisNexis SSNs were inconsistent with registry SSNs, we used Match*Pro to review and verify match status. Registry SSNs were updated for those confirmed to be true matches. Improvement of SSNs was assessed based on percentage reduction of missingness.

Results:

Of 1,396,078 patient records submitted for LexisNexis linkage, 1.6% were not matched. Among those matched, 1.5% did not have SSNs returned. Multivariate logistic regression analyses indicated that patients who were female, Black, Asian Pacific Islander (API), Hispanic, born outside the United States, deceased, or living in poorer census tracts were more likely to not have LexisNexis matches, or to not have SSNs returned. Among 47,271 patients with missing registry SSNs (3.4%), 26,895 had SSNs returned from LexisNexis, and 24,919 were confirmed to be true matches. After registry SSNs updates, the percentage of SSN missingness was reduced to 1.7%, with a larger absolute reduction observed among those who were younger than 60 years, API, or alive. For 33,057 patients with inconsistent SSNs, 11,474 were due to incorrect consolidations of SSNs in the registry, and those SSNs were subsequently fixed.

Conclusions:

LexisNexis is a valuable resource for improving the quality of SSN information in registries. Our results showed that the overall percentage of patients with missing SSNs was reduced from 3.4% to 1.7% after LexisNexis link-age, and SSNs that were initially incorrectly consolidated for some patients were also identified and subsequently fixed. However, the magnitude of SSN improvement varied by patient demographic characteristics. Data quality improvements often require resources, and this evaluation can assist registries with decisions related to similar efforts.

Keywords: LexisNexis, Social Security number

Introduction

Population-based central cancer registries in the United States collect data on patient demographics, cancer diagnosis, staging, treatment, and follow-up information for cancer patients diagnosed in their catchment areas.1 Social Security number (SSN) is a standard data item that has been routinely collected. SSN is an important data element that is used for patient matching, deduplication, follow-up, and linkage studies.2,3 However, a small proportion of patient records have missing or inaccurate SSNs in registries.

The NYSCR, funded by the Centers for Disease Control and Prevention's National Program of Cancer Registries (NPCR) since 1995 and by the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) program since 2018, is one of the largest registries in the nation, collecting data on more than 120,000 newly diagnosed cancer cases each year. As one of the SEER registries, the NYSCR recently had the opportunity to participate in linkages of registry and LexisNexis data. LexisNexis is a commercial database containing public and proprietary information for over 276 million individuals in the United States.4 Even though the NYSCR had previously used LexisNexis batch searches to obtain or verify birth date, SSN, and address for patients with missing, incomplete, or conflicting information, those linkages included limited patient records.5 For example, in Pradhan and Boscoe's study,5 only 5,958 patients diagnosed during 2003–2010 (representing 0.7% of all cases diagnosed in that time period) were selected for assessment of SSN improvement using LexisNexis. However, this new SEER-sponsored large-scale linkage allowed us to systematically evaluate the usefulness of LexisNexis for improving data quality on demographic information of cancer patients. The purpose of the current study was to evaluate the feasibility of using LexisNexis to improve SSN information in the NYSCR.

Materials and Methods

A total of 1,396,078 cancer patients diagnosed during 2005–2016 at age 21 years or older in the NYSCR were submitted for LexisNexis linkages. For the matched patients, LexisNexis returned first name, last name, middle name, birth date, SSN, up to 3 phone numbers, and 20 addresses, as available.

We first examined the patients who had missing registry SSNs prior to LexisNexis linkages by the following patient demographic characteristics: sex (male or female), age at linkage (<60, 60–<70, 70–<80, 80–<90, or ≥90 years), race (White, Black, American Indian/Alaska Native, Asian/Pacific Islander [API], or unknown), ethnicity (non-Hispanic or Hispanic), birthplace (United States, non–United States, or unknown), census tract poverty level (assigned based on address at cancer diagnosis: 0%–<5%, 5%–<10%, 10%–<20%, 20%–100%, or unknown), and vital status (deceased or alive). Then, based on the linkage results, we calculated percentage of patients who had no LexisNexis matches and the percentage who had LexisNexis matches, but no SSNs were returned, by patient demographic characteristics. We used multivariate logistic regression analyses to further evaluate how patient demographic characteristics affected the likelihood of no LexisNexis matches or of no SSNs returned.

For patients with SSNs returned, we compared LexisNexis SSNs with registry SSNs to determine their consistency. If patients had prior missing registry SSNs or the returned LexisNexis SSNs were different from the registry SSNs, patients' names, birth dates, phone numbers, and addresses were further compared using Match*Pro software6 to verify match status. Based on the similarity scores of the data fields in comparison, we determined whether manual review was needed (Figure 1). If the SSNs returned from LexisNexis were different from registry SSNs (consolidated values), registry source level SSNs were reviewed to determine whether there were any consolidation issues. Registry SSNs were updated for those confirmed to be true matches. Improvement of registry SSNs was assessed using absolute and relative reductions in percentage missing SSN overall and by demographic characteristics.

Figure 1.

Figure 1

Steps for Evaluation of Social Security Numbers (SSNs) Returned from LexisNexis (LN) Linkage

Results

The detailed steps taken for this evaluation (both automated and manual effort) are illustrated in Figure 1. Of 1,396,078 patient records submitted for LexisNexis linkages, 22,810 (1.6%) were not matched. Among 1,373,268 (98.4%) with matches, 1.5% had no SSNs returned. Demographic characteristics of patients without LexisNexis matches or with LexisNexis matches but without SSNs are shown in Table 1. Notably, percentages of patients who had no LexisNexis matches were higher among Black (3.1%), API (7.8%), and Hispanic (4.9%) individuals, as well as those born outside the United States (5.8%). Among those with LexisNexis matches, patients who were API (6.0%), Hispanic (4.1%), or born outside the United States (5.0%) also had higher percentages of no LexisNexis SSNs.

Table 1.

Characteristics of Patients Without LexisNexis Matches or With LexisNexis Matches but Without LexisNexis SSNs, and Odds Ratios With 95% CIs from Multivariate Logistic Regression Analyses

Demographic characteristics Patients without LexisNexis matches Patients with LexisNexis matches but without SSNs
n % Adjusted OR (95% CI) n % Adjusted OR (95% CI)
Total 22,810 1.6 NA 20,662 1.5 NA
Sex a
 Male 10,311 1.5 Reference 8,725 1.3 Reference
 Female 12,493 1.7 1.05 (1.03–1.08) 11,929 1.7 1.22 (1.19–1.26)
Age at LexisNexis linkage (y)
 <60 4,705 1.9 Reference 3,914 1.6 Reference
 60–<70 4,904 1.7 1.04 (1.00–1.08) 3,236 1.1 0.79 (0.75–0.83)
 70–<80 5,532 1.5 1.11 (1.07–1.16) 4,133 1.2 0.91 (0.87–0.96)
 80–<90 4,806 1.7 1.47 (1.40–1.53) 4,811 1.7 1.46 (1.39–1.53)
 ≥90 2,863 1.3 1.59 (1.51–1.68) 4,568 2.1 2.13 (2.02–2.24)
Race
 White 10,370 0.9 Reference 12,525 1.1 Reference
 Black 6,100 3.1 2.34 (2.26–2.42) 3,976 2.1 1.25 (1.21–1.30)
 American Indian/Alaska Native 15 0.7 0.86 (0.51–1.43) 16 0.7 0.73 (0.45–1.20)
 Asian and Pacific Islander 5,111 7.8 3.58 (3.44–3.73) 3,628 6.0 2.45 (2.34–2.56)
 Unknown 1,214 10.1 7.28 (6.81–7.79) 517 4.8 3.10 (2.82–3.40)
Ethnicity
 Non-Hispanic 17,007 1.3 Reference 15,622 1.2 Reference
 Hispanic 5,803 4.9 1.70 (1.64–1.76) 5,040 4.1 1.55 (1.49–1.61)
Birthplace
 United States 3,377 0.4 Reference 4,936 0.6 Reference
 Outside the United States 13,659 5.8 8.00 (7.67–8.35) 11,166 5.0 5.42 (5.22–5.64)
 Unknown 5,774 1.7 2.36 (2.25–2.47) 4,560 1.3 1.95 (1.87–2.04)
Census tract poverty level (%)
 0–<5 2,657 0.7 Reference 2,123 0.6 Reference
 5–<10 3,931 1.1 1.29 (1.22–1.35) 3,483 1.0 1.53 (1.45–1.61)
 10–<20 7,114 1.9 1.75 (1.68–1.84) 6,402 1.7 2.36 (2.25–2.48)
 20–100 8,808 3.1 1.96 (1.87–2.05) 8,598 3.1 3.26 (3.10–3.43)
 Unknown 300 7.8 12.59 (11.02–14.39) 56 1.6 3.30 (2.52–4.33)
Vital status
 Deceased 7,995 1.2 Reference 10,212 1.5 Reference
 Alive 14,815 2.1 0.55 (0.53–0.57) 10,450 1.5 0.84 (0.81–0.87)

NA, not applicable; OR, odds ratio; SSN, Social Security number.

a

Patients with unknown sex are not shown in the table.

Multivariate logistic regression analyses indicated that patients who were female, Black, API, Hispanic, born outside the United States, deceased, or living in poorer census tracts were more likely to not have LexisNexis matches, and also to not have SSNs returned (Table 1). Compared to patients younger than 60 years, patients aged 60–<80 years were more likely to have no LexisNexis matches and less likely to have no LexisNexis SSNs returned when matches were found. Patients aged ≥80 years were at increased likelihoods of both no LexisNexis matches and no LexisNexis SSNs. Patients with unknown race, birthplace, or poverty level were also more likely to have no LexisNexis matches and no LexisNexis SSNs returned.

Prior to LexisNexis linkage, 47,271 (3.4%) patients had missing registry SSNs, with higher percentages observed among those who were younger than 60 years at the time of linkage (7.5%), Black (5.0%), API (10.3%), of unknown race (26.5%), Hispanic (7.9%), born outside the United States (7.2%), with unknown birthplace (7.2%), living in the poorest or unknown census tracts (4.9%), and alive (5.8%) (Table 2). 26,895 patients with missing registry SSNs had SSNs returned from LexisNexis (56.9%). Using Match*Pro, 19,498 (72.5%) were determined to be true matches without manual review, and 5,421 (20.2%) were confirmed to be true matches through manual review. Match status could not be verified for 1,976 (7.3%) patient records.

Table 2.

Characteristics of Patients with Missing Registry Social Security Number (SSN) Prior to or Post LexisNexis Linkage, and Registry SSN Improvement after LexisNexis Linkage

Demographic characteristics Patients submitted for LexisNexis linkage, n (%) Patients with missing registry SSN prior to LexisNexis linkage Patients with missing registry SSN post LexisNexis linkage Reduction of missing registry SSN
n % n % Absolute (%) Relative (%)
Total 1,396,078 (100) 47,271 3.4 23,294 1.7 1.7 50.7
Sex a
 Male 665,376 (47.7) 22,409 3.4 10,342 1.6 1.8 54.0
 Female 730,538 (52.3) 24,846 3.4 12,946 1.8 1.6 47.9
Age at LexisNexis linkage (y)
 <60 241,203 (17.3) 18,185 7.5 8,253 3.4 4.1 54.6
 60–<70 292,091 (20.9) 13,001 4.5 5,861 2.0 2.4 54.8
 70–<80 355,974 (25.5) 9,496 2.7 4,879 1.4 1.3 48.7
 80–<90 283,022 (20.3) 4,913 1.7 3,039 1.1 0.7 38.5
 ≥90 223,788 (16.0) 1,676 0.8 1,262 0.6 0.2 25.3
Race
 White 1,119,033 (80.2) 27,499 2.5 11,222 1.0 1.5 59.3
 Black 197,210 (14.1) 9,771 5.0 6,210 3.2 1.8 36.4
 American Indian/Alaska Native 2,203 (0.2) 45 2.0 23 1.0 1.0 49.0
 Asian and Pacific Islander 65,559 (4.7) 6,756 10.3 4,175 6.4 3.9 38.2
 Unknown 12,073 (0.9) 3,200 26.5 1,664 13.8 12.7 48.0
Ethnicity
 Non-Hispanic 1,266,467 (90.7) 37,079 2.9 16,268 1.3 1.7 56.3
 Hispanic 129,611 (9.3) 10,192 7.9 7,026 5.4 2.4 31.0
Birthplace
 United States 816,561 (58.5) 5,612 0.7 2,366 0.3 0.4 58.0
 Outside the United States 236,176 (16.9) 17,097 7.2 13,760 5.8 1.4 19.5
 Unknown 343,341 (24.6) 24,562 7.2 7,168 2.1 5.1 70.8
Census tract poverty level (%)
 0–<5 361,050 (25.9) 8,279 2.3 2,714 0.8 1.5 67.2
 5–<10 368,475 (26.4) 10,469 2.8 4,111 1.1 1.7 60.6
 10–<20 376,573 (27.0) 14,204 3.8 7,394 2.0 1.8 48.0
 20–100 286,121 (20.5) 13,962 4.9 8,769 3.1 1.8 37.3
 Unknown 3,859 (0.3) 357 9.3 306 7.9 1.3 14.3
Vital status
Deceased 682,217 (48.9) 5,940 0.9 5345 0.8 0.1 10.3
Alive 713,861 (51.1) 41,331 5.8 17,949 2.5 3.3 56.6

SSN, Social Security number.

a

Patients with unknown sex are not shown in the table.

Registry missing SSNs were updated with LexisNexis SSNs for 23,977 patient records, resulting in an overall percentage of missingness reduced to 1.7%. A larger absolute percentage reduction was observed among those who were younger than 60 years (4.1%), API (3.9%), alive (3.3%), or with unknown race (12.7%) or birthplace (5.1%) (Table 2). Returned LexisNexis SSNs for 942 individuals were thought to be Individual Tax Identification Numbers rather than SSNs and therefore, were not added to the registry.

For 33,057 patients who had known registry SSNs but had different SSNs returned from LexisNexis (Figure 1), source level SSNs reported to the registry were further examined. A total of 12,071 (36.5%) had at least 1 source record that reported the same SSN as LexisNexis. After review, 11,474 (95.0%) matches were confirmed, and registry SSNs were subsequently reconsolidated using the correct source-level SSNs for those patients. The 20,986 patients who did not have the same SSNs as LexisNexis reported by any registry sources will be reviewed in the future. To resolve conflicting SSNs for those patients, we might need to use another independent data source, such as hospital discharge administrative data, to help us determine which SSNs are correct.

Discussion

The NYSCR had the opportunity to participate in the project of linking registry data with the LexisNexis data-base during 2019–2021 as part of the SEER program. Per the SEER linkage protocol, all cancer patients diagnosed during 2005–2016 at age 21 years or older were selected for LexisNexis linkage. Even though the primary objective of the project was to obtain residential history of cancer patients, LexisNexis also returned other demographic information including SSN for the matched patient records. Based on the results of this large-scale linkage, the current study evaluated the feasibility of using LexisNexis to improve SSNs in the NYSCR.

Our results showed that the overall LexisNexis matching rate was remarkably high. Among nearly 1.4 million cases submitted for linkage, matching records were found in LexisNexis for 98.4%. However, the match rate varied considerably by patient demographic characteristics. For example, the match rates were significantly lower for individuals who identified as Black, API, or Hispanic, or those who were born outside the United States or with an unknown race or birthplace, compared to the reference groups. These findings were consistent with previous reports. Woolpert et. al7 studied the validity of LexisNexis in identifying state of residence at death using the Georgia Cancer Registry's Cancer Recurrence and Information Surveillance cohort, and they found that cohort members who were Black, API, or Hispanic had higher odds of being missed by linkage to LexisNexis compared to White and non-Hispanic members. Lower LexisNexis match rates among API and Hispanic cancer patients have also been reported by Tatalvich et al.8 The lower LexisNexis match rates observed among minority race/ethnicity groups and those born outside the United States are likely due to missing or incomplete information in the LexisNexis data-base for those individuals. Our study also found that similar patient demographic characteristics determined the likelihood of obtaining SSNs from LexisNexis among patients with matches.

Prior to the LexisNexis linkages, about 3.4% of patients had missing SSNs in the NYSCR. After updating SSNs using information obtained from LexisNexis, the overall percentage of SSN missingness was reduced to 1.7%. Although patients who identified as API or who had unknown race or birthplace were less likely to have LexisNexis matches or SSNs returned, a large absolute reduction of SSN missingness was still achieved for these groups because the percentages of missing SSNs were much higher prior to linkage. A larger SSN improvement was also seen for patients who were younger than 60 years at linkage or who were still alive.

The NYSCR has a history of using LexisNexis for data quality improvement. About a decade ago, Pradhan and Boscoe5 used LexisNexis Batch searches to obtain or verify birth date, SSN, and address for patients with missing or conflicting information in the NYSCR and found that LexisNexis was a cost-effective solution for resolving data quality issues. Since then, LexisNexis has been regularly used by NYSCR geocoding staff for obtaining and verifying patient demographic information. Recently, the Michigan State Cancer Registry also highlighted its success in using LexisNexis linkage to improve SSN and vital status information.9 LexisNexis, however, has some known limitations. LexisNexis contains public and proprietary records of individuals, but such information is usually not available for minors. Therefore, linkage with LexisNexis for pediatric cancer patients would be less helpful than it is for adult patients. Thus, the SEER–LexisNexis linkage only included cancer patients aged 21 years or older.

The current study has 2 notable strengths compared to the previous evaluations. First, this SEER-sponsored LexisNexis linkage included a much larger number of patient records, allowing us to conduct more systematic and comprehensive evaluations of LexisNexis' usefulness in improving SSNs. For example, we were able to assess SSN improvement overall, as well as by detailed patient demographic characteristics. In addition, the effects of demographic characteristics on LexisNexis match rate and SSNs returned were also thoroughly examined. Second, the match records returned from LexisNexis have been reviewed and verified using Match*Pro. Through this process, we identified a small number of incorrect LexisNexis matches, then subsequently excluded them from SSNs updates. Some of those matches appeared to be for relatives of the patients rather than for the patients themselves. The LexisNexis database contains billions of records collected from vast and diverse data sources, and thus may contain some errors. Furthermore, as in all linkages, particularly ones at such a large scale, mismatches cannot be totally prevented. Therefore, it is necessary to conduct additional review and match verification before making any updates to a registry database.

In our evaluation, about 75% of matches returned from LexisNexis could be confirmed automatically, but the remaining 25% of matches required manual review. Two staff members were involved in match verifications using Match*Pro and it took us approximately 1 week to complete the process. However, it is worth noting that the similarity scores we set for no manual review in the current evaluation were relatively high, and we believe the number of patient records requiring manual review could be further reduced through adjusting the review criteria. In addition, we found that appropriate use of the filter function in Match*Pro could speed up the review process. Data quality improvements often require resources. Our results could provide some insights for other registries that are interested in conducting a similar evaluation.

In conclusion, our study demonstrated that LexisNexis can be a valuable resource for improving the quality of SSN information in cancer registries. However, because LexisNexis occasionally returns incorrect patient matches, additional review and verification of LexisNexis matches are recommended to avoid updating registry SSNs with results from incorrect matches. This evaluation can assist registries with decisions related to similar improvement efforts.

Acknowledgments

We would like to thank Zoran Ilic for helping with the manual review.

Footnotes

This project was funded in part by the Centers for Disease Control and Prevention's (CDC) National Program of Cancer Registries through cooperative agreement 6NU58DP006309 awarded to the New York State Department of Health and by the National Cancer Institute, National Institutes of Health (NIH), Department of Health and Human Services, under Contract 75N91018D00005. The contents are solely the responsibility of the New York State Department of Health and do not necessarily represent the official views of the CDC or NIH.

References

  • 1.Thornton ML, ed. Standards for Cancer Registries Volume II: Data Standards and Data Dictionary, Version 24. North American Association of Central Cancer Registries; 2023. [Google Scholar]
  • 2.Jacobs EJ, Briggs PJ, Deka A, et al. Follow-up of a large prospective cohort in the United States using linkage with multiple state cancer registries. Am J Epidemiol. 2017;186(7):876–884. doi: 10.1093/aje/kwx129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Nadpara PA, Madhavan SS.. Linking Medicare, Medicaid, and cancer registry data to study the burden of cancers in West Virginia. Medicare Medicaid Res Rev. 2012;2(4):E1–E24. doi: 10.5600/mmrr.002.04.a01 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.LexID linking technology. LexisNexis Risk Solutions website. Accessed August 29, 2023. https://risk.lexisnexis.com/our-technology/lexid
  • 5.Pradhan E, Boscoe FP.. Evaluation of LexisNexis Batch Solutions for cancer registries in New York state. Presented at: The North American Association of Central Cancer Registries Annual Meeting; June 2014. [Google Scholar]
  • 6.Match*Pro software. Surveillance, Epidemiology, and End Results Program website. https://seer.cancer.gov/tools/matchpro
  • 7.Woolpert KM, Ward KC, England CV, Lash TL.. Validation of LexisNexis Accurint in the Georgia Cancer Registry's cancer recurrence and information surveillance program. Epidemiology. 2021;32(3):434–438. doi: 10.1097/EDE.0000000000001327 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Tatalovich Z, Stinchcomb DG, Mariotto A, Ng D, Stevens JL, Coyle LM, Penberthy L.. Assessment of interstate residential mobility of SEER patients: SEER and LexisNexis residential address linkage. J Reg Manag. 2022;49(4):109–113. [PMC free article] [PubMed] [Google Scholar]
  • 9.Michigan Cancer Surveillance Program; Alverson G, DeMint T.. LexisNexis linkage to improve social security number and vital status information. In: 2021. National Program of Cancer Registries Success Stories. Accessed September 22, 2023. http://www.cancerregistryeducation.org/Files/Org/f3f3d382a7a242549a9999654105a63b/site/2021%20CDC-NPCR%20Posters%20(24wx36h)%20for%20Printer.pdf

Articles from Journal of Registry Management are provided here courtesy of National Cancer Registrars Association

RESOURCES