Abstract
Purpose
Electronic health records (EHRs) comprise a rich source of real-world data for cancer studies, but they often lack critical structured data elements like diagnosis date and disease stage. Fortunately, such concepts are available from hospital cancer registries. We describe experiences from integrating cancer registry data with electronic health record and billing data in an interoperable data model across a multi-site clinical research network.
Methods
After sites implemented cancer registry data into a tumor table compatible with the PCORnet Common Data Model, distributed queries were performed to assess quality issues. After remediation of quality issues, another query produced descriptive frequencies of cancer types and demographic characteristics. This included linked body mass index. We also report two current use cases of the new resource.
Results
Eleven sites implemented the tumor table, yielding a resource with data for 572,902 tumors. Institutional and technical barriers were surmounted to accomplish this. Variation in racial and ethnic distributions across the sites were observed; the percent of tumors among Black patients ranged from less than 1% to 15% across sites, and the percent of tumors among Hispanic patients ranged from 1% to 46% across sites. Current use cases include a pragmatic prospective cohort study of a rare cancer and a retrospective cohort study leveraging body size and chemotherapy dosing.
Conclusion
Integrating cancer registry data with the PCORnet CDM across multiple institutions creates a powerful resource for cancer studies. It provides a wider array of structured, cancer-relevant concepts, and it allows investigators to examine variability in those concepts across many treatment environments. Having the CDM tumor table in place enhances the impact of the network’s effectiveness for real-world cancer research.
Introduction
Electronic health records (EHRs) comprise a rich source of real-world data for cancer studies. However, critical, structured data elements are often absent from these records, including diagnosis date, disease stage, and histology. It is essential to assess such factors for most cancer studies since they relate to treatment decisions, outcomes, and patient experiences. Although much of this information can be found in the EHR, it is often represented in clinical notes that must be abstracted for analysis.1
Hospital cancer registries offer a solution. Each institution accredited by the American College of Surgeons Commission on Cancer operates a registry of cancer patients seen at their facilities. These registries employ standards developed by the North American Association of Central Cancer Registries (NAACCR). These standards include data fields for demographics, treatments, tumor characteristics, and cancer outcomes. Specialists review EHRs to abstract such concepts into structured data sets which are used to assess quality of care.2 In addition, these data are submitted to central cancer registries that generate cancer statistics.
Because of the complementary relationship between EHR and registry data3, the Greater Plains Collaborative (GPC), a PCORnet® clinical research network, incorporated the NAACCR data into the Common Data Model (CDM) framework used by members of PCORnet. Here, we describe the lessons learned through the interactions with registrars and informatics teams at GPC institutions. We present data to suggest the value of these resources for real-world cancer research.4–6
Methods
The PCORnet CDM comprises tables containing data for demographics, procedures, medications, diagnoses, vital status, lab results, and patient-reported outcomes.7 These tables are populated from EHRs and billing systems by PCORnet institutions. Adding NAACCR data was accomplished by developing a tumor table that is compatible with the CDM framework. This table allows the NAACCR data to be linked to the other CDM tables at the patient level.
The cancer registry data are incorporated into a tumor table that GPC sites are required to implement and annually update. The current specification of this table (Version 1.2) includes all 774 data elements that comprise NAACCR Version 18.8 The CDM patient identifier is also added to the tumor table, allowing linkage to other CDM tables. The NAACCR data provides many fields that were used to link the data to the CDM tables, including medical record number, patient name, and birth date. Per PCORnet requirements, fields with identifying data are dropped or set to null before the table is made available for use.
The CDM is commonly queried using the SAS application (SAS Institute Inc., Cary, NC), and many of the NAACCR field names were shortened to reflect the 32-character limit on SAS variable names. To prevent confusion from the shortened names, the NAACCR item number was appended to the column names. For example, NAACCR Item #400 is named “Primary Site”, and its corresponding tumor table data element is named “PRIMARY_SITE_N400”. The item number allows users to find corresponding documentation in the NAACCR data dictionary.9
After implementing the tumor table, sites executed a distributed SAS query to assess quality. This query generated descriptive statistics that each site shared with the project management team.8 A set of elements that are generally reliably populated in NAACCR data (diagnosis year and cancer site, in particular) were examined for missing or invalid values. Missing variables were also noted along with changes in case count over time. We also examined whether patients in the tumor table had records in the CDM encounter table.
Cancer registries generally finalize a tumor record after the first course of treatment is planned. However, an additional record can be generated if a registry finds new information about a completed case. Multiple records can also be mistakenly generated (e.g., when a cancer is identified from more than one source). Identifying tumors with multiple records was a focus of the quality evaluation; many such records would raise concerns about quality. A tumor can be uniquely identified at each facility with a patient identifier and the NAACCR sequence number for the tumor (Item #560). Registries that cover more than one hospital also need to include a field for facility to uniquely identify specific tumors.
After remediation of quality issues, another distributed query evaluated the usefulness of the linkage by generating descriptive frequencies of cancer types and demographic characteristics. This included body mass index, an important variable for cancer research that is not available in either cancer registry or claims data.
We also report two use cases of the new resource. One is a pragmatic prospective cohort study of a rare cancer, and the other is a retrospective cohort study examining body size and chemotherapy dosing.
Results
Securing Data Access
Institutional challenges presented some of the greatest difficulties to overcome. Cancer centers had not generally provided registry data to other groups; developing a trust relationship and establishing data sharing procedures took time. Even once the agreement to share the data was obtained, there was occasionally an initial preference to treat the request as a one-time data extract. Similarly, there was sometimes a “minimum data necessary” tradition that had to be addressed before cancer centers agreed to provide all NAACCR fields. It helped to clarify that the data would be used to create a general-purpose data resource rather than used for a specific research project.
Securing the support of cancer center leadership for these efforts was important. It was especially helpful to enlist cancer epidemiology researchers at these institutions to convince leadership of the benefits of this resource; those investigators are often familiar with cancer surveillance data and could knowledgably advocate for its inclusion.
Populating the Tumor Table
The tumor table is substantially larger (hundreds of data elements) than other CDM tables (the most for any of the other tables is 60 elements). This presented technical challenges for some GPC sites (see below). However, there were notable benefits to including the entire set of data elements. Namely, it allowed use of standard NAACCR file outputs and existing registry processes. Institutional cancer registries have procedures in place to share complete sets of data with public health authorities (e.g., a state-based central registry), and abstracting software generates NAACCR-compliant files for this purpose. To take advantage of these existing data-sharing processes, the tumor table was designed to incorporate values for all NAACCR fields. Using these standardized output files also allowed sites to learn from each other and share code that could be adapted across the GPC. This saved hundreds of hours of programming time across the sites. It is important to note, however, that sites were not required to use existing data extraction/loading processes; they were only required to follow the tumor table specification.
Other challenges were institution specific. For example, some cancer registries cover multiple health care facilities, and this required a greater investment of resources to appropriately map the registry data to the table specification. Other sites had to work with external contractors that support their registry’s abstracting software, and this sometimes involved contractual issues.
Quality Review
Of the eleven sites that successfully implemented the tumor table, seven initially had problems with incomplete data. These problems were often revealed by low counts or abrupt changes in counts across time. In other cases, values for specific variables were missing. In these situations, sites successfully resolved the issues by working with their respective registries.
We also examined tumor table records that did not have corresponding patient entries in the CDM encounter table. Each patient represented in the tumor table was expected to have an entry in the encounter table, although a small number of registry-only cases would not be a cause for concern. In fact, a complete lack of such cases would be surprising. We found five sites initially had a complete lack of registry-only cases. Follow-up revealed that the registry-only cases were mistakenly removed as part of a particularly rigorous approach to data cleaning at four of those sites. Ultimately, these cases were restored to the tumor table. For a fifth site, the follow-up revealed several idiosyncratic problems that were addressed. Following remediation, the percentage of registry-only cases for most sites ranged from less than 0.1% to 3.1%. For one site, the percentage of registry-only cases was almost 17%, but that was attributed to its inclusion of older registry cases who were diagnosed before the period covered by the other CDM tables.
There was one GPC site whose registry data were not successfully linked to their CDM, and those data are not included among the eleven sites represented in this report. The linkage problem was revealed by an exceptionally high proportion of registry-only cases: over 70% of the patients in the tumor table lacked an entry in the encounter table.
There were four sites that initially had a substantial number of tumors with multiple records. For three of these sites, the number of tumors with multiple records ranged from 15–18%. For a fourth site, over 90% of the tumors had more than one record. These sites worked with their registries to address the issue, and ultimately ten of the eleven sites reported no more than 0.2% of their tumors with multiple records. The percentage of tumors with multiple records for the remaining site was 1.7%.
Usefulness of the Linkage
Table 1 describes data for the 572,902 tumors from all GPC sites that have implemented the tumor table. This includes data for rare cancers (e.g., 17,787 pancreatic tumors and 6,051 pediatric tumors). Differences in racial and ethnic distributions across the sites were observed; the percent of tumors among Black patients ranged from less than 1% to 15% across sites, and the percent of tumors of Hispanic patients ranged from 1% to 46%.
Table 1.
Descriptive characteristics for current tumor data from the eleven Greater Plains Collaborative sites that have successfully implemented the tumor table.
| Greater Plains Collaborative Site | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Site A |
Site B |
Site C |
Site D |
Site E |
Site F |
Site G |
Site H |
Site I |
Site J |
Site K |
| Total N 1 | 67,931 | 34,757 | 68,154 | 56,006 | 75,762 | 23,219 | 34,193 | 21,892 | 84,039 | 55,572 | 51,377 |
| Age at Diagnosis | |||||||||||
| <18 years | 250 (0%) | 270 (1%) | 570 (1%) | 1,170 (2%) | 590 (1%) | 260 (1%) | 430 (1%) | 450 (2%) | 930 (1%) | 730 (1%) | 390 (1%) |
| 18–30 years | 1,250 (2%) | 460 (1%) | 1,810 (3%) | 1,710 (3%) | 2,040 (3%) | 720 (3%) | 1,040 (3%) | 730 (3%) | 2,760 (3%) | 2,470 (4%) | 1,190 (2%) |
| 31–45 years | 5,730 (8%) | 1,890 (5%) | 6,010 (9%) | 5,110 (9%) | 6,950 (9%) | 2,160 (9%) | 3,310 (10%) | 2,660 (12%) | 9,590 (11%) | 6,790 (12%) | 4,110 (8%) |
| 46–60 years | 18,950 (28%) | 7,830 (23%) | 19,230 (28%) | 16,370 (29%) | 21,910 (29%) | 6,920 (30%) | 9,080 (27%) | 7,390 (34%) | 26,130 (31%) | 15,290 (28%) | 14,190 (28%) |
| 61–75 years | 27,820 (41%) | 15,270 (44%) | 28,710 (42%) | 22,660 (40%) | 32,520 (43%) | 9,520 (41%) | 15,080 (44%) | 8,230 (38%) | 34,530 (41%) | 22,530 (41%) | 23,340 (45%) |
| 76+ years | 13,940 (21%) | 9,040 (26%) | 11,820 (17%) | 8,970 (16%) | 11,570 (15%) | 3,630 (16%) | 5,270 (15%) | 2,420 (11%) | 10,100 (12%) | 7,770 (14%) | 8,150 (16%) |
| Not available3 | 0 (0%) | 10 (0%) | 0 (0%) | 0 (0%) | 170 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| Body Mass Index | |||||||||||
| <19 | 1,180 (2%) | 670 (2%) | 1,430 (2%) | 1,750 (3%) | 1,710 (2%) | 820 (4%) | 750 (2%) | 630 (3%) | 2,200 (3%) | 1,150 (2%) | 1,350 (3%) |
| 19–24 | 14,690 (22%) | 6,310 (18%) | 14,800 (22%) | 10,940 (20%) | 16,440 (22%) | 4,950 (21%) | 6,980 (20%) | 4,480 (20%) | 20,620 (25%) | 12,530 (23%) | 11,030 (21%) |
| 25–29 | 20,630 (30%) | 10,860 (31%) | 21,050 (31%) | 15,410 (28%) | 23,090 (30%) | 6,620 (29%) | 10,080 (29%) | 6,520 (30%) | 28,040 (33%) | 16,950 (31%) | 15,470 (30%) |
| 30–35 | 16,240 (24%) | 9,480 (27%) | 15,910 (23%) | 12,680 (23%) | 18,080 (24%) | 5,380 (23%) | 8,480 (25%) | 5,050 (23%) | 18,830 (22%) | 11,920 (21%) | 12,470 (24%) |
| 36+ | 10,060 (15%) | 6,100 (18%) | 9,750 (14%) | 9,700 (17%) | 11,600 (15%) | 4,180 (18%) | 5,390 (16%) | 3,170 (14%) | 9,940 (12%) | 6,980 (13%) | 8,190 (16%) |
| Not available3 | 5,140 (8%) | 1,340 (4%) | 5,210 (8%) | 5,530 (10%) | 4,850 (6%) | 1,280 (6%) | 2,520 (7%) | 2,040 (9%) | 4,420 (5%) | 6,040 (11%) | 2,860 (6%) |
| Class of Case 2 | |||||||||||
| Analytic | 67,840 (100%) | 32,910 (95%) | 50,320 (74%) | 47,260 (84%) | 68,470 (90%) | 20,840 (90%) | 26,130 (76%) | 12,400 (57%) | 60,880 (72%) | 54,700 (98%) | 45,240 (88%) |
| Non-analytic | 90 (0%) | 1,500 (4%) | 17,840 (26%) | 8,700 (16%) | 6,740 (9%) | 2,360 (10%) | 8,060 (24%) | 9,490 (43%) | 23,160 (28%) | 870 (2%) | 6,140 (12%) |
| Not available3 | 0 (0%) | 350 (1%) | 0 (0%) | 40 (0%) | 560 (1%) | 20 (0%) | 10 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| Race | |||||||||||
| American Indian/Alaska Native | 300 (0%) | 400 (1%) | 220 (0%) | 140 (0%) | 220 (0%) | 40 (0%) | 150 (0%) | 30 (0%) | 240 (0%) | 600 (1%) | 90 (0%) |
| Asian/Pacific Islander | 1,050 (2%) | 280 (1%) | 870 (1%) | 600 (1%) | 1,140 (1%) | 190 (1%) | 420 (1%) | 430 (2%) | 3,930 (5%) | 1,150 (2%) | 680 (1%) |
| Black | 2,120 (3%) | 80 (0%) | 6,670 (10%) | 1,650 (3%) | 5,980 (8%) | 1,190 (5%) | 1,670 (5%) | 1,250 (6%) | 12,620 (15%) | 510 (1%) | 7,600 (15%) |
| White | 63,670 (94%) | 33,590 (97%) | 59,520 (87%) | 52,880 (94%) | 66,230 (87%) | 21,530 (93%) | 31,610 (92%) | 19,560 (89%) | 65,770 (78%) | 52,720 (95%) | 42,790 (83%) |
| Other | 30 (0%) | 10 (0%) | 530 (1%) | 190 (0%) | 1,830 (2%) | 160 (1%) | 140 (0%) | 100 (0%) | 310 (0%) | 240 (0%) | 180 (0%) |
| Not available3 | 770 (1%) | 410 (1%) | 340 (1%) | 550 (1%) | 370 (0%) | 120 (1%) | 210 (1%) | 530 (2%) | 1,150 (1%) | 360 (1%) | 50 (0%) |
| Sex | |||||||||||
| Female | 40,220 (59%) | 17,660 (51%) | 34,310 (50%) | 27,910 (50%) | 39,620 (52%) | 11,750 (51%) | 16,640 (49%) | 11,580 (53%) | 39,440 (47%) | 27,010 (49%) | 26,210 (51%) |
| Male | 27,700 (41%) | 17,090 (49%) | 33,820 (50%) | 28,070 (50%) | 35,990 (48%) | 11,470 (49%) | 17,540 (51%) | 10,310 (47%) | 44,590 (53%) | 28,540 (51%) | 25,130 (49%) |
| Other | 10 (0%) | 10 (0%) | 20 (0%) | 20 (0%) | 30 (0%) | 10 (0%) | 10 (0%) | 0 (0%) | 10 (0%) | 20 (0%) | 20 (0%) |
| Not available3 | 0 (0%) | 0 (0%) | 0 (0%) | 10 (0%) | 120 (0%) | 0 (0%) | 10 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 20 (0%) |
| Hispanic Ethnicity | |||||||||||
| Hispanic | 850 (1%) | 300 (1%) | 1,710 (3%) | 1,080 (2%) | 2,200 (3%) | 270 (1%) | 1,160 (3%) | 9,990 (46%) | 10,370 (12%) | 3,100 (6%) | 320 (1%) |
| Non-Hispanic | 66,250 (98%) | 32,250 (93%) | 66,020 (97%) | 54,100 (97%) | 71,510 (94%) | 22,780 (98%) | 32,690 (96%) | 11,180 (51%) | 72,590 (86%) | 51,520 (93%) | 50,860 (99%) |
| Not available3 | 830 (1%) | 2,210 (6%) | 420 (1%) | 820 (1%) | 2,050 (3%) | 170 (1%) | 340 (1%) | 720 (3%) | 1,090 (1%) | 950 (2%) | 200 (0%) |
| Year of Diagnosis | |||||||||||
| 2010 | 3,310 (5%) | 3,210 (9%) | 4,290 (6%) | 3,830 (7%) | 3,080 (4%) | 1,470 (6%) | NA | 1,500 (7%) | 4,830 (6%) | 2,370 (4%) | 200 (0%) |
| 2011 | 3,750 (6%) | 3,210 (9%) | 4,640 (7%) | 4,050 (7%) | 3,580 (5%) | 1,550 (7%) | NA | 1,950 (9%) | 5,150 (6%) | 3,180 (6%) | 250 (0%) |
| 2012 | 4,520 (7%) | 3,010 (9%) | 4,740 (7%) | 4,200 (8%) | 5,360 (7%) | 1,710 (7%) | NA | 1,870 (9%) | 5,540 (7%) | 3,420 (6%) | 320 (1%) |
| 2013 | 5,250 (8%) | 2,910 (8%) | 4,900 (7%) | 4,380 (8%) | 5,840 (8%) | 1,710 (7%) | NA | 1,990 (9%) | 5,650 (7%) | 3,580 (6%) | 410 (1%) |
| 2014 | 5,350 (8%) | 2,780 (8%) | 5,040 (7%) | 4,320 (8%) | 6,130 (8%) | 1,960 (8%) | 2,090 (6%) | 1,980 (9%) | 5,920 (7%) | 3,910 (7%) | 540 (1%) |
| 2015 | 5,480 (8%) | 2,590 (7%) | 5,270 (8%) | 4,580 (8%) | 6,360 (8%) | 1,940 (8%) | 3,510 (10%) | 2,100 (10%) | 6,390 (8%) | 4,340 (8%) | 810 (2%) |
| 2016 | 5,500 (8%) | 2,490 (7%) | 5,290 (8%) | 4,750 (8%) | 6,490 (9%) | 1,990 (9%) | 3,590 (11%) | 1,990 (9%) | 6,720 (8%) | 4,460 (8%) | 1,340 (3%) |
| 2017 | 5,590 (8%) | 2,540 (7%) | 5,600 (8%) | 4,930 (9%) | 6,550 (9%) | 2,300 (10%) | 3,840 (11%) | 1,840 (8%) | 7,460 (9%) | 4,710 (8%) | 6,860 (13%) |
| 2018 | 5,680 (8%) | 2,400 (7%) | 5,750 (8%) | 5,150 (9%) | 6,860 (9%) | 2,330 (10%) | 4,000 (12%) | 1,580 (7%) | 7,880 (9%) | 3,830 (7%) | 9,360 (18%) |
| 2019 | 5,920 (9%) | 2,400 (7%) | 5,950 (9%) | 5,240 (9%) | 7,020 (9%) | 2,510 (11%) | 4,020 (12%) | 1,390 (6%) | 8,170 (10%) | 4,120 (7%) | 9,220 (18%) |
| 2020 | 5,150 (8%) | 2,260 (6%) | 5,020 (7%) | 4,370 (8%) | 6,360 (8%) | 2,230 (10%) | 3,690 (11%) | 1,200 (5%) | 7,250 (9%) | 4,050 (7%) | 8,650 (17%) |
| 2021 | 6,220 (9%) | 2,460 (7%) | 5,250 (8%) | 3,120 (6%) | 6,600 (9%) | 1,520 (7%) | 4,050 (12%) | 1,390 (6%) | 6,880 (8%) | 5,240 (9%) | 8,740 (17%) |
| 2022 | 6,200 (9%) | 2,110 (6%) | 5,360 (8%) | 2,860 (5%) | 5,540 (7%) | 0 (0%) | 4,060 (12%) | 1,040 (5%) | 5,380 (6%) | 5,530 (10%) | 4,240 (8%) |
| 2023 | 0 (0%) | 410 (1%) | 1,050 (2%) | 220 (0%) | 10 (0%) | 0 (0%) | 1,350 (4%) | 70 (0%) | 840 (1%) | 2,820 (5%) | 440 (1%) |
| Not available3 | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| Tumor Site | |||||||||||
| Breast | 15,380 (23%) | 6,050 (17%) | 9,140 (13%) | 3,790 (7%) | 13,630 (18%) | 2,800 (12%) | 4,110 (12%) | 4,470 (20%) | 11,390 (14%) | 7,000 (13%) | 7,260 (14%) |
| Colon/Rectum | 4,630 (7%) | 2,330 (7%) | 3,970 (6%) | 2,690 (5%) | 4,260 (6%) | 1,400 (6%) | 2,270 (7%) | 1,610 (7%) | 3,420 (4%) | 2,580 (5%) | 3,330 (6%) |
| Kidney/Renal pelvis | 2,590 (4%) | 1,000 (3%) | 2,360 (3%) | 2,330 (4%) | 3,230 (4%) | 1,180 (5%) | 1,340 (4%) | 820 (4%) | 5,160 (6%) | 1,920 (3%) | 2,230 (4%) |
| Lung | 6,810 (10%) | 3,930 (11%) | 5,770 (8%) | 4,890 (9%) | 6,700 (9%) | 2,960 (13%) | 3,430 (10%) | 1,790 (8%) | 6,410 (8%) | 2,850 (5%) | 5,870 (11%) |
| Ovary | 840 (1%) | 450 (1%) | 920 (1%) | 1,070 (2%) | 970 (1%) | 290 (1%) | 310 (1%) | 180 (1%) | 610 (1%) | 740 (1%) | 620 (1%) |
| Pancreas | 1,590 (2%) | 860 (2%) | 3,260 (5%) | 2,080 (4%) | 2,190 (3%) | 560 (2%) | 1,320 (4%) | 540 (2%) | 2,010 (2%) | 1,480 (3%) | 1,890 (4%) |
| Prostate | 5,350 (8%) | 4,550 (13%) | 6,630 (10%) | 3,510 (6%) | 6,010 (8%) | 1,540 (7%) | 3,120 (9%) | 2,110 (10%) | 11,470 (14%) | 5,330 (10%) | 5,010 (10%) |
| Uterus | 3,610 (5%) | 1,440 (4%) | 2,080 (3%) | 3,420 (6%) | 2,160 (3%) | 750 (3%) | 780 (2%) | 840 (4%) | 1,440 (2%) | 1,480 (3%) | 1,870 (4%) |
| Other | 27,130 (40%) | 14,140 (41%) | 34,030 (50%) | 32,230 (58%) | 36,530 (48%) | 11,730 (51%) | 17,530 (51%) | 9,530 (44%) | 42,140 (50%) | 32,200 (58%) | 23,300 (45%) |
| Not available3 | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 100 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
To avoid reporting data that could allow identification of specific patients, all counts except Total N are rounded to the nearest 10 and all percents are rounded to the nearest 1.
NAACCR item #610. Analytic cases are those who were either diagnosed or treated at the reporting facility.
“Not available” items were either coded as unknown, had invalid values, or were missing.
The “class of case” variable (NAACCR Item #610) indicates whether the patient was diagnosed or treated at the reporting hospital. Patients who were diagnosed or had at least part of their first course of therapy at the facility are defined as “analytic”. This contrasts with patients seen at the facility for a second opinion or another consultation. Researchers who use NAACCR data often restrict study cohorts to analytic cases since those records are considered complete and accurate. Table 1 shows that the proportion of analytic tumors ranges across sites from 57% to almost 100%. This may indicate that registries vary in the extent to which they prioritize abstracting non-analytic cases.
Linking registry and other CDM tables allows the assessment of more cancer-relevant concepts than either source by itself. To illustrate this, the distribution of body mass index (BMI) values is included in Table 1. These values were computed using height and weight measures found in the CDM vital table. (If multiple weights were available, the measure taken most closely in time to the tumor diagnosis was used. For adults, the modal height measure was used; for children, the height measure taken most closely in time to the selected weight measure was used.) At least one BMI measurement was more likely to be available for analytic cases; across sites, an average of 94% of analytic records could be matched to a BMI measurement compared to 89% of non-analytic records. We also observed that the median number of days between the diagnosis date recorded in the tumor table and the closest available BMI measurement was much shorter for analytic cases (data not shown).
These results illustrate the characteristics of tumors recorded by the GPC registries at the time of this writing. New records continue to be added; of the sites that have implemented the tumor table, five update their table monthly, another five update the table quarterly, and one site updates their table semiannually.
Current Use of the Tumor Table
Cancer registry data has played an important role for research within other networks,10,11 and it already plays an important role in two PCORnet projects:
The Neuroendocrine Tumors – Patients Reported Outcomes (NET-PRO) study is a prospective cohort study, enrolling over 2,500 neuroendocrine tumor (NET) patients from 14 clinical centers in the PCORnet network, almost half of which are GPC sites. The main goal of the project is to use longitudinal data to examine treatment effectiveness and how it is affected by tumor/patient factors and treatment sequencing.12
As with any study, identifying patients for recruitment is a prime concern. To help sites with this, the NET-PRO lead site developed two computable phenotypes that operated on the clinical data stored at each site’s data warehouse. The “low touch” phenotype was designed to apply strict criteria to identify potential participants with high confidence of eligibility. Patients identified in this way would be recruited with low touch methods like e-mail. Another phenotype was designed to identify anyone who possibly had a NET. The eligibility of patients identified in this more lenient way would be confirmed with a brief chart review before attempting recruitment. The tumor table was used to determine the validity of these phenotypes by serving as a “gold standard” against which phenotype-identified patients were compared.
In addition to its use for validation, the registry data were used to identify patients for recruitment. Since registries need time to identify and fully describe cases, there is a lag before all cases are completely abstracted. For that reason, the registry records were useful for identifying earlier-diagnosed cases.
The Chemotherapy Dosing in Patients with Obesity study is using data from GPC sites in a retrospective cohort study to evaluate real-world chemotherapy treatment for patients with breast cancer. Balancing the benefit and harms of chemotherapy has always been challenging; only a small change in dose separates insufficient therapeutic effect from life-threatening toxicity. The appropriate balance with chemotherapy is even less clear for patients with obesity as there is limited trial evidence for this group.
This study will provide real-world estimates of chemotherapy dosing to patients with and without obesity and will assess the outcomes associated with those treatment decisions. Because the indication for chemotherapy depends on tumor stage, grade, and molecular subtype, the corresponding data from the tumor table has been required for identifying the study cohort.
In addition, the registry data will guide interpretation of data from the other CDM tables. In particular, the “class of case” field indicates how much of a patient’s treatment was provided at the reporting hospital. If this field indicates that the patient’s entire treatment was provided at the hospital, investigators can be more confident that the lack of a treatment record truly indicates that no treatment took place. Similarly, NAACCR fields that indicate extent and timing of treatments can inform interpretation of data from the CDM procedures table (e.g., classifying chemotherapy as adjuvant or neoadjuvant).
Discussion
By linking the tumor registry to the PCORnet CDM tables from eleven GPC institutions, we created a powerful resource that currently includes records for more than half a million tumors. This linkage allows each source of information to compensate for limitations in the other. For example, registry records have structured data that are absent in EHR/billing data found in the CDM. CDM tables, in contrast, can have more recent data since it can take up to a year for registries to finalize a tumor record. CDM tables may also contain longitudinal data that is not captured by registries. Together, this linked resource expands the possibilities for the use of real-world data for cancer research.
With this resource, investigators can develop patient cohorts based on cancer characteristics (using data elements from the tumor table) and track treatments (using data elements from the CDM tables). Additionally, the CDM tables allow identification of cancer patients with specified comorbidities and clinical measurements. For example, we demonstrated that almost 94% of the identified tumors that were diagnosed or treated at the reporting facility could be matched to body mass index values calculated from data in the CDM tables. Importantly, working within the PCORnet CDM framework enables such queries to be readily implemented across institutional boundaries.
Results indicated the importance of thoroughly evaluating the tumor table data for use in specific research studies. Most of the sites needed to address quality issues before data were research ready (e.g., incomplete data and multiple records for tumors). Differing refresh schedules may also lead to different data availability between sites. Similarly, we found that each site had a small number of tumors with multiple records. Investigators should screen for these and any other relevant issues to ensure they are properly considered in the analyses.
Implementing the tumor table required building new relationships at most sites since clinical data warehouses are often maintained by research informatics teams and cancer registries are usually operated by cancer centers. These teams often lack close working relationships.
These efforts are already benefitting cancer research. For the NET-PRO study, the tumor table was used to validate computable phenotypes for patient identification and to identify patients for recruitment. It will be an important part of the analytic data set. For the Chemotherapy Dosing study, the data on stage, grade, and molecular subtype has been crucial for identifying the study cohort, and it also provides context for understanding treatment records.
Strengths and Limitations
This report covers GPC experiences securing access to registry data through the implementation of the tumor table. This should provide valuable guidance for other institutions, although additional barriers may exist at non-GPC sites. Additional issues will likely emerge as data holdings are refreshed. Updates to the NAACCR standards and other changes (e.g., hospital mergers) may affect these processes.
The tumor table is currently based on Version 18 of the NAACCR data dictionary. The GPC will be developing processes for keeping the tumor table aligned with NAACCR changes. We also plan to update the quality control scripts to reflect lessons learned during the tumor table roll-out. These will be posted online as they become available.8
Conclusion
Integrating cancer registry data with the PCORnet CDM across institutions creates a powerful resource for cancer studies. It provides a wider array of structured, cancer-relevant concepts, allowing investigators to examine variability in those concepts across treatment environments. Queries can be executed efficiently and rapidly. Having the CDM tumor table in place greatly enhances the network’s effectiveness for real-world cancer research. Investigators can learn how to access these data at the GPC web page.13
Context Summary.
Key Objective:
We describe the experiences and lessons learned from integrating cancer registry data with electronic health record and billing data in an interoperable data model across the Greater Plains Collaborative, a PCORnet® clinical research network.
Knowledge Generated:
Eleven sites successfully overcame institutional barriers and addressed data quality issues to create a resource with records for 572,902 tumors. Variability for patient characteristics between sites were observed. This linkage is already being used for at least two multi-site cancer studies to define a study cohort, directly identify patients for recruitment, and understand treatment records.
Relevance (EIC Warner):
This study shows the successful multi-institutional harmonization to the PCORnet® standard, enabling scaling of multi-site cancer studies.
Acknowledgments
This work is supported through a Patient-Centered Outcomes Research Institute (PCORI) Program Award (#RI-MISSOURI-01-PS1; also, RD-2020C2–20329 for Michael O’Rorke, Bradley McDowell, & Elizabeth Chrischilles) and by the National Cancer Institute (R50CA243692). Other support for the project came from the National Institutes of Health (P30CA086862 for Bradley McDowell & Elizabeth Chrischilles, UM1TR004403 for Bradley McDowell, Elizabeth Chrischilles, & Boyd Knosp, UL1TR001436 for Bradley Taylor, UL1TR003167 for Alejandro Araya & Brian Shukwit).
Contributor Information
Bradley D. McDowell, University of Iowa, Holden Comprehensive Cancer Center, Iowa City, IA.
Michael A. O’Rorke, University of Iowa, Department of Epidemiology, Iowa City, IA.
Mary C. Schroeder, University of Iowa, Iowa City, IA.
Elizabeth A. Chrischilles, University of Iowa, Department of Epidemiology, University of Iowa, Iowa City, IA.
Christine M. Spinka, University of Missouri, Columbia, MO.
Lemuel R. Waitman, University of Missouri, Westwood Hills, KS.
Kelechi Anuforo, Kansas University Medical Center, Kansas City, KS.
Alejandro Araya, University of Texas, Houston, Houston, TX.
Haddyjatou Bah, University of Utah, Salt Lake City, UT.
Jackson Barlocker, University of Utah Health, Salt Lake City, UT.
Sravani Chandaka, Kansas University Medical Center, Kansas City, KS.
Lindsay G. Cowell, O’Donnell School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX.
Carol R. Geary, University of Nebraska Medical Center, Omaha, NE.
Snehil Gupta, Washington University, Saint Louis, MO.
Benjamin D. Horne, Intermountain Heart Institute, Intermountain Health, Salt Lake City, UT.
Boyd M. Knosp, University of Iowa, Iowa City, IA.
Albert M. Lai, Washington University, St. Louis, MO.
Vasanthi Mandhadi, University of Missouri, Columbia, MO.
Abu Saleh Mohammad Mosa, University of Missouri, Columbia, MO.
Phillip Reeder, University of Texas Southwestern Medical Center, Dallas, TX.
Giyung Ryu, University of Iowa, Institute for Clinical and Translational Science, Iowa City, IA.
Brian Shukwit, University of Texas, Houston, Houston, TX.
Claire Smith, Allina Health, Minneapolis, MN.
Alexander J. Stoddard, Medical College of Wisconsin, Milwaukee WI.
Mahanazuddin Syed, University of Texas Science Center at San Antonio, Department of Population Health Sciences, San Antonio, TX.
Shorabuddin Syed, University of Texas Science Center at San Antonio, San Antonio, TX.
Bradley W. Taylor, Medical College of Wisconsin, Milwaukee, WI.
Jeffrey J. VanWormer, Marshfield Clinic Research Institute, Marshfield, WI.
References
- 1.Emamekhoo HA-O, Carroll CA-O, Stietz C, et al. Supporting Structured Data Capture for Patients With Cancer: An Initiative of the University of Wisconsin Carbone Cancer Center Survivorship Program to Improve Capture of Malignant Diagnosis and Cancer Staging Data. JCO Clin Cancer Inform 2022(2473–4276 (Electronic)) (In eng). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Optimal Resources for Cancer Care: 2020 Standards. American College of Surgeons. (https://accreditation.facs.org/accreditationdocuments/CoC/Standards/Optimal_Resources_for_Cancer_Care_Feb_2023.pdf). [Google Scholar]
- 3.Charlton ME, Kahl AR, McDowell BD, et al. Cancer Registry Data Linkage of Electronic Health Record Data From ASCO’s CancerLinQ: Evaluation of Advantages, Limitations, and Lessons Learned. JCO Clin Cancer Inform 2022;6:e2100149. (In eng). DOI: 10.1200/cci.21.00149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Waitman LR, Aaronson LS, Nadkarni PM, Connolly DW, Campbell JR. The Greater Plains Collaborative: a PCORnet Clinical Research Data Network. J Am Med Inform Assoc 2014;21(4):637–41. (In eng). DOI: 10.1136/amiajnl-2014-002756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Forrest CB, McTigue KM, Hernandez AF, et al. PCORnet® 2020: current state, accomplishments, and future directions. J Clin Epidemiol 2021;129:60–67. (In eng). DOI: 10.1016/j.jclinepi.2020.09.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Qualls LG, Phillips TA, Hammill BG, et al. Evaluating Foundational Data Quality in the National Patient-Centered Clinical Research Network (PCORnet®). EGEMS (Wash DC) 2018;6(1):3. (In eng). DOI: 10.5334/egems.199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.PCORnet. Common Data Model Specification, Version 6.1. (https://pcornet.org/wp-content/uploads/2023/04/PCORnet-Common-Data-Model-v61-2023_04_031.pdf).
- 8.PCORnet. Tumor table specifications. Curators of the University of Missouri. (https://github.com/gpcnetwork/Tumor-Table). [Google Scholar]
- 9.North American Association of Central Cancer Registries. Data Standards and Data Dictionary. NAACCR. (https://apps.naaccr.org/data-dictionary/home). [Google Scholar]
- 10.Hornbrook MC, Hart G, Ellis JL, et al. Building a virtual cancer research organization. J Natl Cancer Inst Monogr 2005(35):12–25. (In eng). DOI: 10.1093/jncimonographs/lgi033. [DOI] [PubMed] [Google Scholar]
- 11.SEER-Medicare Linked Data Resource. National Cancer Institute. (https://healthcaredelivery.cancer.gov/seermedicare/). [Google Scholar]
- 12.O’Rorke M, Chrischilles E. Making progress against rare cancers: A case study on neuroendocrine tumors. Cancer 2024. (In eng). DOI: 10.1002/cncr.35184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Researcher Resources. Greater Plains Collaborative. (https://gpcnetwork.org/projects/researcher-resources/). [Google Scholar]
