Skip to main content
Journal of the National Cancer Institute. Monographs logoLink to Journal of the National Cancer Institute. Monographs
. 2020 May 15;2020(55):89–95. doi: 10.1093/jncimonographs/lgz035

Development and Evaluation of a Process to Link Cancer Patients in the SEER Registries to National Medicaid Enrollment Data

Joan L Warren 1, Suzie Benner 2, Jennifer Stevens 3, Lindsey Enewold 1,, Bin Huang 4, Lirong Zhao 5, Negussie Tilahun 1, Cathy J Bradley 6
PMCID: PMC7868030  PMID: 32412075

Abstract

Cancer patients receiving Medicaid have worse prognosis. Patients in 14 Surveillance, Epidemiology, and End Results (SEER) cancer registries were linked to national Medicaid enrollment files, 2006–2013, to determine enrollment status during the year before and after diagnosis. A deterministic algorithm based on Social Security number, Medicare Health Insurance Claim number, sex, and date of birth was utilized. Results were compared with an independent linkage of Kentucky-based SEER and Medicaid data. A total 559 484 cancer cases were linked to national Medicaid enrollment files, representing 15–17% of persons with cancer yearly. About 60% of these cases were a complete match on all variables. There was 99% agreement on enrollment status compared with the Kentucky linked data. SEER data were successfully linked to national Medicaid enrollment data. NCI will make the linked data available to researchers, allowing for more detailed assessments of the impact Medicaid enrollment has on cancer diagnosis and outcomes.


Medicaid is a federal-state program to make health-care coverage available for low-income residents of a state who otherwise would likely be uninsured. The program is offered to children, families, pregnant women, the aged, and the disabled, although the specific eligibility criteria and services covered by Medicaid are determined by federal and state criteria. Each state differs by eligibility threshold and covered medical services. Medicaid status is increasingly important as researchers and policy makers attempt to assess the value of Medicaid coverage to low-income and otherwise uninsured patients.

Studies have demonstrated that cancer patients receiving Medicaid are diagnosed at a more advanced stage, receive less guideline care, and have poorer survival (1–8). Prior studies have linked cancer registry to Medicaid data to create data resources to assess the impact of Medicaid eligibility on cancer diagnosis and treatment (8–12) . These projects have been limited to single states. Linkage of multiple cancer registries to Medicaid enrollment files for the entire United States has never occurred. The development of a reliable and comprehensive approach to link persons in cancer registries to national Medicaid enrollment would facilitate the creation of a data resource that researchers could use to assess the impact of Medicaid eligibility on diagnosis and treatment for cancer patients.

This article reports on a linkage of patients included in the National Cancer Institute’s (NCI) Surveillance, Epidemiology, and End Results (SEER) cancer registries to Medicaid enrollment data from all 50 states and the District of Columbia. We describe the linkage algorithm used to match patients in the two data sources and the process we used to quantify the degree of certainty that the correct patients were linked. We assessed the quality of the linkage by comparing the patients identified as enrolled in Medicaid via our algorithm to patients identified as Medicaid enrollees in a separate, independent linkage of patients in the Kentucky Cancer Registry (KCR) with the Kentucky Medicaid enrollment records.

As a result of this project, NCI will create a much-needed data resource: a file that includes all SEER patients found to be enrolled in any Medicaid program within a 25-month window of time around the date of cancer diagnosis (12 months before, month of, and 12 months after). The file will include monthly flags indicating whether the person was eligible for Medicaid and the reason for entitlement. Monthly Medicaid eligibility will be reported for up to three states, accounting for changes in residence for some Medicaid patients with cancer. This file will be available to researchers for approved studies.

Methods

Data Sources

The primary sources were data from 14 SEER cancer registries for persons diagnosed with cancer, 2006–2013, and enrollment data from Medicaid’s Personal Summary (PS) file for each state from 2006 to 2013. We utilized personal information for SEER patients augmented with information from the SEER-Medicare linkage (13). Data from the KCR linked to Kentucky Medicaid enrollment data (KCR-KM) were used to assess the quality of the SEER-Medicaid linkage.

SEER Cancer Registry Data

The SEER registries, funded by NCI, collect information for all persons with incident cancer occurring within defined geographic areas. The registries available for our analysis included those in California (Northern California, Greater California, and Los Angeles), Connecticut, Georgia, Hawaii, Iowa, Kentucky, Louisiana, New Jersey, New Mexico, and Utah, and metropolitan Detroit and Seattle. These registries include 30% of the US population (14). New patients in the SEER registries are assigned a unique random case number. Registries collect data about each patient’s demographics, primary tumor site, histology and stage at diagnosis, initial cancer treatment, number of prior cancers, and follow-up for vital status.

The registries have legal authority to obtain personal information for each patient in their data. The personal information collected includes a Social Security number (SSN), sex, and date of birth. This information is used to track patients over time and consolidate multiple reports for the same patient. Personal information is not released by the registries to the public but is provided to NCI to permit the linkage of the SEER data to Medicare files following approval by each registry’s Institutional Review Board (IRB). NCI obtained permission from each registry to reuse the personal identifiers provided for the SEER-Medicare linkage for the new match to Medicaid enrollment data. All personal information was destroyed after completion of the SEER-Medicare and SEER-Medicaid linkages.

State-Level Medicaid PS Files

The Medicaid Analytic eXtracts (MAX) Personal Summary (PS) files are extracted from the Medicaid Statistical Information System (MSIS) and contain information about Medicaid enrollment submitted from all 50 states and the District of Columbia to the Center for Medicare and Medicaid Services (CMS). Each state’s file includes persons enrolled in Medicaid during a given year, along with their MSIS number, SSN, sex, date of birth, and category of Medicaid eligibility. Patient name was not available on the CMS files (15). In addition, for persons enrolled in both Medicare and Medicaid (ie, dual eligibles), the PS file also includes the person’s unique Medicare number, known as the Health Insurance Claim (HIC) number. The HICs for individuals on the PS files are obtained by linking state Medicaid enrollment files to Medicare’s enrollment data. PS data were available for this study from 2006 to 2013 for all states except Kentucky and New Mexico, which had data through 2012. For this project, we attempted to link all persons in the SEER data to Medicaid enrollment information for all 50 states and the District of Columbia. To provide a clearer description of the linkages and the results, we report findings using Medicaid information only from the SEER state that reported the patient’s cancer.

SEER-Medicare Patient Identifiers and Crosswalk File

Persons in the SEER data are linked to Medicare enrollment data biennially using a deterministic match based on personal information (SSN, sex, date of birth) included in both data sources. The match results in a linkage of more than 93% of persons aged 65 years or older in the SEER data to their Medicare data, and their HIC number is obtained (16). For persons included in the SEER-Medicare data, all SEER case numbers and HIC numbers are extracted and stored in the SEER-Medicare Crosswalk file. The registries have obtained IRB approval to release to NCI those personal identifiers needed to link the SEER and Medicare data. The creation and maintenance of the SEER-Medicare Crosswalk file have been approved by each registry’s IRB and the NCI IRB.

KCR Data

The KCR has participated in the SEER program since 2000. In 2015, the KCR undertook a project to link persons with lung, female breast, colorectal, pancreatic, ovary, and prostate cancer diagnosed in 2000–2011 in the KCR files with health claims, 2000–2011, from private and public payors, including Kentucky Medicaid enrollment data provided by the Department of Medicaid Services. A probabilistic approach was used to link the KCR and Medicaid data using SSN, sex, date of birth, first name, last name, and middle name (17,18). All potential matches were manually reviewed to confirm that they were true matches. The final KCR-KM data were limited to cases diagnosed in 2007–2011 because these cases were also linked with other claims sources.

Linkage of Persons in the SEER Data to Medicaid PS File

Two files were created to link persons in the SEER data to the Medicaid PS files. The first file, the “SEER Match file,” included cancers occurring 2006–2013. For patients with multiple primary cancers, the first cancer occurring in 2006–2013 was included. For each patient, we retained their SEER case number, SSN, sex, and date of birth (Figure 1). In addition, we used the SEER case number to link to the SEER-Medicare Crosswalk file to obtain the HIC numbers for SEER patients who were also eligible for Medicare. The SEER Match file included only one record per person.

Figure 1.

Figure 1.

Process of linking persons in the Surveillance, Epidemiology, and End Results (SEER) data diagnosed with cancer, 2007 to 2013, to persons in each State’s Medicaid Personal Summary (PS) file. HIC = Health Insurance Claim; MSIS = Medicaid Statistical Information System; SSN = Social Security number.

The second file consisted of multiple “Medicaid Match files,” which were created from the PS files for each state and year (2006–2012 or 2013). For each year, persons were identified as a Medicaid beneficiary if they had at least 1 month of Medicaid enrollment during the year. Persons could appear in more than one Medicaid Match file (eg, more than 1 year and/or state); each Medicaid Match file included the state-specific MSIS identifier, SSN, HIC (if available), sex, and date of birth (Figure 1).

We used a deterministic matching process. To be considered a match, a patient’s HIC or SSN, as listed in the SEER Match file, had to exactly match a record in one of the Medicaid Match files. If a patient’s HIC or SSN was not matched to any of the Medicaid Match files, they were considered not enrolled in Medicaid. The number of matched patients was tallied by SEER registry and year. The percentage of patients that matched was determined by dividing the number of matched patients by the total number of patients for each registry by year.

To quantify the strength of the match, we assigned points for the agreement between the SEER and the Medicaid data on three tiers of variables: 1) HIC and SSN, 2) sex, and 3) date of birth (Figure 2). We first assessed the match on HIC and/or SSN. If there was an exact match on all digits in both the HIC and SSN, eight points were given. Four points were given if there was an exact match on HIC, but not the SSN or if there was a match on SSN and sex but not on the HIC. A match on sex was required with the SSN match because the SSN on the file may be that of the spouse rather than the Medicaid recipient. Unlike the SSN, the HIC, by design, includes information about whether the individual is the husband or wife. An exact match on SSN but without agreement on sex was given a lower value (three points). A match on sex was given one point. For date of birth, we used a hierarchical approach: exact match on (day/month/year, three points), month and year (two points), or year only (one point). The points were summed; the maximum possible score was 12, the minimum was 3 (all cancer cases were required to match at least on SSN to be included). We expected that the accuracy of personal information on the Medicaid data could vary over multiple years of within-state enrollment. Therefore, we reported the patient’s highest state-specific match score in any year and applied that score to all years. For ease of reporting, we consolidated match scores into four groups: complete match (score: 12), strong match (score: 10–11), moderate match (score: 7–9), and weak match (score: 6 or less). These scores are reported by cancer registry.

Figure 2.

Figure 2.

Scoring weights assigned to each variable used to calculate the strength of the match between persons in the SEER data and the Medicaid Personal Summary file. DOB = date of birth; HIC = Health Insurance Claim; SSN = Social Security number.

Assessment of the Linkage

We assessed the quality of our linkage process by comparing a subset of cancer patients from the Kentucky registry included in our SEER-Medicaid linkage to patients included in the independently linked KCR-KM data, as described above. Because the KCR is part of the SEER program, the Kentucky patients in the SEER-Medicaid and the KCR-KM data had the same SEER case numbers, allowing us to directly match patients with incident diagnoses in the two data sources for the years of overlapping data, 2007–2011. The SEER-Medicaid data were limited to persons with lung, female breast, colorectal, pancreatic, ovary, and prostate cancer cases to correspond with the cancers included in the KCR-KM data. The level of agreement about Medicaid enrollment between the SEER-Medicaid and KCR-KM data was determined for the year of the cancer diagnosis and classified in four categories: both sources agreed that the person was enrolled in Kentucky Medicaid; both sources agreed that the person was not enrolled in Kentucky Medicaid; or only one source, KCR-KM or SEER-Medicaid, reported enrollment in Kentucky Medicaid. In both data sources, patients were classified as Medicaid enrolled if they were enrolled for at least 1 month of the year. In addition to determining if there was agreement between the SEER-Medicaid and the KCR-KM data as to Medicaid enrollment, we also assessed the agreement between the two sources on the number of months that the cancer patients were enrolled in Medicaid.

Results

There were more than 420 000 patients with incident cancer reported each year by the SEER registries between 2006 and 2013 (Table 1). Of these patients, 15–17% were matched to Medicaid each year. The percent of cancer patients enrolled in Medicaid varied by registry, with Louisiana having the highest percent of patients with over 20%. Utah had the lowest percent of cancer patients enrolled in Medicaid at 10% or less each year. The percent of cancer patients enrolled in Medicaid in each state remained stable from 2006 to 2012 or 2013 except Connecticut, where the percent increased from 10.9% in 2006 to 21.5% in 2013, resulting from the state’s early adoption of Medicaid expansion (19).

Table 1.

Number of people with incident cancers in the 2006–2013 SEER data* and percent who matched the MAX PS file during the year following cancer diagnosis by SEER registry and year

2006 2007 2008 2009 2010 2011 2012 2013
SEER registry No. (%) No. (%) No. (%) No. (%) No. (%) No. (%) No. (%) No. (%)
California
 Los Angeles 39 727  (23.9) 41 031  (23.8) 40 458  (23.8) 40 621  (24.4) 40 011 (23.2) 39 414  (24.1) 38 356  (23.4) 38 531  (23.2)
 Northern California 32 375  (15.5) 33 050  (15.5) 33 126  (15.4) 33 513  (16.1) 33 532 (15.8) 33 015  (16.2) 32 818  (16.6) 32 540  (16.0)
 Greater California 90 049  (15.6) 92 350  (15.6) 93 445  (16.5) 93 296  (16.7) 92 820 (16.5) 91 948  (17.5) 91 739  (17.6) 92 337  (17.8)
Connecticut 22 348  (10.9) 22 047  (11.6) 21 770  (11.7) 21 972  (12.0) 21 408 (17.3) 20 928  (19.4) 20 624  (21.0) 20 406  (21.5)
Detroit 23 554  (14.2) 23 873  (13.3) 23 297  (14.3) 23 280  (15.2) 22 884 (16.4) 23 253  (16.6) 22 218  (16.4) 22 042  (15.5)
Georgia 42 620  (15.6) 44 801  (15.6) 45 233  (16.9) 45 700  (17.2) 44 775 (16.5) 46 236  (16.9) 47 256  (17.6) 46 773  (17.0)
Hawaii 6730  (13.2) 6837  (12.7) 7038  (13.4) 7030  (14.1) 6903 (14.81) 6834  (15.4) 6830  (15.4) 6996  (15.5)
Iowa 176 57  (11.9) 17 450  (12.4) 17 464  (13.0) 17 911  (13.1) 17 594 (13.2) 17 481  (13.9) 16 983  (14.8) 16 753  (14.3)
Kentucky 26 413  (19.8) 27 078  (19.7) 27 359  (19.9) 27 680  (20.1) 28 427 (20.3) 28 685  (20.4) 29 481  (19.6) 29 432
Louisiana 23 506  (22.5) 24 469  (22.8) 24 955  (23.0) 25 972  (23.5) 25 869 (24.1) 26 660  (24.9) 26 699  (24.4) 26 823  (23.2)
New Jersey 54 328  (9.5) 54 943  (9.3) 55 077  (9.7) 55 840  (9.8) 55 438 (10.1) 56 296  (10.1) 56 252  (10.5) 57 213  (10.8)
New Mexico 9245  (14.1) 9678  (14.5) 9910  (15.4) 9985  (18.0) 9992  (17.8) 9962  (17.6) 9880  (18.7) 9893
Seattle 25 217  (10.9) 26 744  (10.7) 27 031  (10.9) 27 976  (11.7) 28 094  (12.6) 28 840  (12.5) 28 570  (12.3) 29 119  (12.6)
Utah 9456  (8.7) 9763  (8.43) 10 231  (8.7) 10 465  (8.7) 10 886  (9.0) 11 228  (9.5) 11 332  (10.0) 11 537  (9.6)
Total 423 225  (15.2) 434 114  (15.2) 436 394  (15.7) 441 241  (16.2) 438 633  (16.4) 440 780  (17.0) 439 038  (17.1) 440 395  (15.2)
*

For patients with multiple primary cancers, includes the first cancer diagnosed during 2006–2013. MAX = Medicaid Analytic eXtracts; PS = Personal Summary; SEER = Surveillance, Epidemiology, and End Results.

†2013 PS file data not available.

Table 2 reports the distribution of match scores by SEER registry. In every registry except Greater California, approximately 60% of the SEER patients who were linked to Medicaid data had a complete match on all variables, with almost all remaining matches having a moderate score. For all registries, except Greater California, less than 2% of all matches had a weak match score. In the Greater California registry, almost no persons had a complete match on variables. We assessed the reporting of specific variables in Greater California and found that 1% of persons matched on day of birth. The match rate on month and year of birth (not specific day) exceeded 97% for persons in the Greater California data. The data file from the Greater California registry may have had an anomaly, but we could not effectively assess the data because, per the agreement with the SEER registries, NCI destroyed the file with the personal information from each of the registries after the linkage was completed.

Table 2.

Number of cancer patients in the SEER data who matched in the Medicaid Enrollment Data and the score for the strength of the match by individual cancer registry*

Medicaid match score (12 = highest; 1 = lowest)
Complete match Strong match Moderate match Weak match
Score 12 10–11 7–9 ≤6
SEER registry % % % % Total cases, No.
California
 Los Angeles 63.7 1.8 32.4 2.1 75 462
 Northern California 65.5 1.2 31.4 2.0 41 976
 Greater California 0.7 62.9 2.8 33.7 123 368
Connecticut 68.5 1.4 29.1 1.0 26 678
Detroit 59.8 1.2 37.5 1.5 28 067
Georgia 62.1 2.2 33.9 1.8 60 610
Hawaii 58.8 0.7 39.6 1.0 7901
Iowa 65.2 0.8 33.5 0.4 18 525
Kentucky 66.7 2.0 29.7 1.6 38 946
Louisiana 63.0 2.1 33.4 1.5 48 298
New Jersey 65.9 2.2 29.9 2.0 44 396
New Mexico 58.4 1.7 38.0 2.0 11 398
Seattle 64.4 0.6 34.2 0.9 26 122
Utah 57.5 0.6 40.9 1.1 7737
*

SEER = Surveillance, Epidemiology, and End Results.

In the comparison of the SEER-Medicaid data from Kentucky with the independent KCR-KM linkage, there were approximately 14 000 KCR patients diagnosed with incident lung, breast, colorectal, pancreatic, ovary, and prostate cases annually (Table 3). For each year, the SEER-Medicaid and the KCR-KM linkages agreed that approximately 20% of patients were eligible for Medicaid and 79% of patients were not eligible, a percent agreement in excess of 99% per year. The two data sources had 97% or greater agreement on the number of months patients were eligible each year of diagnosis.

Table 3.

Agreement between SEER-Medicaid and KCR-KM data on the percent of cancer patients who were eligible for Medicaid and months of Medicaid enrollment, 2007–2011*

Agreement on Medicaid eligibility
Agreement on months enrolled
No. Ineligible in both files Eligible in both files In SEER-Medicaid data only In KCR-KM data only Percent agreement No. Percent agreement
2007 14 130 79.4 20.4 0.1 0.1 99.8 2885 97.5
2008 14 030 78.5 21.1 0.1 0.2 99.6 2963 98.2
2009 14 024 79.2 20.5 0.1 0.2 99.7 2869 98.1
2010 14 099 77.9 21.6 0.1 0.4 99.5 3050 96.7
2011 14 200 78.3 21.0 0.3 0.4 99.3 2983 98.3
*

KCR-KM = Kentucky Cancer Registry-Kentucky Medicaid; SEER = Surveillance, Epidemiology, and End Results.

Discussion

Our analysis matched 3 493 820 incident cancer cases in the SEER data to national Medicaid enrollment files and identified 559 484 cases that were eligible for Medicaid. The percent of cancer patients that were matched to Medicaid in our study, 15–17% per year, is slightly higher than national Medicaid enrollment rates of 11–13% reported between 2008 and 2013 (20). The higher rates of Medicaid enrollment that we observed may be the result of newly diagnosed cancer patients enrolling in Medicaid during the peri-diagnostic period, as reported in prior studies (2,21). In our data, the state ranking by percent of cancer patients enrolled in Medicaid was consistent with national reports for those years (20).

We required that to be considered a match, the SEER data and Medicaid files must agree exactly on the patient’s HIC or SSN. Prior linkages of cancer registries to state Medicaid data have used deterministic and probabilistic approaches using identifiers such as SSN, patient name, sex, and date of birth (8–11). One study of Michigan cancer patients age 65 years and older also included a HIC number in the matching process (12). Likewise, we included the HIC number as a match variable. We believe that HIC number is a strong match variable because it is unique to each individual and is used to process Medicare claims. A HIC number was found in both the SEER Match file and the Medicaid Match files for 67.5% of cancer patients in our data. This means that most SEER cases matched to the Medicaid data were dual-eligible, reflecting the fact that cancer is a disease of the elderly. Except for the Greater California registry, more than 90% of SEER patients matched to Medicaid had a score that was a complete, strong, or moderate match. The inclusion of additional variables used to calculate match scores allows researchers to assess the quality of the match and perform sensitivity analyses, including and excluding patients with lower match scores.

This study has several important strengths. This is the first linkage of all patients in the SEER data to Medicaid enrollment information from all states. The state-based data provide a unique opportunity to compare differences in cancer stage and survival based on varying Medicaid coverage across individual states. The quality of the SEER-Medicaid linkage was confirmed by comparison with an independent linkage performed by the KCR. Finally, NCI’s creation of a file that includes Medicaid enrollment information for cancer patients eliminates the administrative and logistical challenges that researchers encounter when linking cancer registry data to Medicaid files.

There are also several limitations to our study. The data in our analysis are from 2006 to 2013, before the Affordable Care Act’s 2014 provisions for Medicaid expansion were adopted (22). The data we used were the most recent national Medicaid enrollment data available from CMS at the time of this linkage. In 2013, CMS began to transition the Medicaid data to the Transformed MSIS (T-MSIS); however, data quality challenges have delayed T-MSIS implementation (23). We believe that the approaches we developed to link persons in the SEER data to the Medicaid PS file could be useful for more current SEER-Medicaid linkages once the T-MSIS system is fully operational. NCI is committed to updating the linkage of SEER patients to more recent Medicaid enrollment information. An additional limitation is that we only evaluated the linkage of cancer patients to the Medicaid data for the state where they lived during the year of their cancer diagnosis. We did not assess the quality of the linkage for patients who moved to another state after their cancer diagnosis. Although the SEER registries include almost one-third of the US population, states where registries are not part of the SEER program are not included in the linkage. This limits the generalizability of analyses using the linked SEER-Medicaid enrollment data, especially for states that have not expanded their Medicaid coverage. Finally, the focus of our project was to develop a method to link SEER cancer patients to Medicaid enrollment information. We opted not to obtain Medicaid claims for health-care utilization because the MAX claims are complex, with variable coverage and quality by state and year. The MAX claims are to be replaced with the T-MSIS Analytic File (TAF) data. CMS announced on November 7, 2019, that the TAF research identifiable files will be made available to researchers (24). With the availability of these data, NCI will assess the possibility of obtaining Medicaid claims for SEER patients. However, these files have different data structures and contents. The quality of these claims would need to be assessed carefully, either by NCI or researchers, because prior studies have reported considerable issues in the completeness of Medicaid claims data (25–27).

In conclusion, this study developed and tested an approach to link persons in multiple SEER cancer registries to national Medicaid enrollment data to identify those cancer patients who were Medicaid enrollees. Knowing which patients are enrolled in Medicaid and the timing of their Medicaid enrollment relative to their cancer diagnosis will allow researchers to undertake studies across states related to stage and outcomes for some of the most vulnerable cancer patients. NCI will create a file for all persons in the SEER data that are matched to Medicaid in any state or multiple states. This file will be made available to researchers once the data release process has been finalized.

Notes

Affiliations of authors: Healthcare Assessment Research Branch, Healthcare Delivery Research Program, Division of Cancer Control and Population Science, National Cancer Institute, Bethesda, MD (JLW, LE); Fu Associates, Ltd, Arlington, VA (SB); Information Management Services, Calverton, MD (JS); Department of Biostatistics, College of Public Health, Markey Cancer Center, University of Kentucky, Lexington, KY (BH); Division of Data, Research, and Analytic Methods, Center for Medicare & Medicaid Innovation, Centers for Medicare and Medicaid Services, Baltimore, MD (LZ); Department of Health Systems, Management and Policy, School of Public Health, University of Colorado, Aurora, CO (CJB).

The contents of this article are solely the responsibility of the authors and do not necessarily represent the official views of the US Department of Health and Human Services or any of its agencies. The authors have no disclosures.

References


Articles from Journal of the National Cancer Institute. Monographs are provided here courtesy of Oxford University Press

RESOURCES