Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Dec 1.
Published in final edited form as: Arthritis Care Res (Hoboken). 2014 Dec;66(12):1790–1798. doi: 10.1002/acr.22377

Linkage of a De-identified United States Rheumatoid Arthritis Registry with Administrative Data to Facilitate Comparative Effectiveness Research

Jeffrey R Curtis 1, Lang Chen 1, Aseem Bharat 1, Elizabeth Delzell 1, Jeffrey D Greenberg 2, Leslie Harrold 3, Joel Kremer 4, Soko Setoguchi 5, Daniel H Solomon 6, Fenglong Xie 1, Huifeng Yun 1
PMCID: PMC4245366  NIHMSID: NIHMS634426  PMID: 24905637

Abstract

Background

Linkages between registries and administrative data may provide a valuable resource for comparative effectiveness research. However, personal identifiers that uniquely identify individuals are not always available. We describe methods to link a de-identified arthritis registry and U.S. Medicare data. The linked dataset was also used to evaluate the generalizability of the registry to the U.S. Medicare population.

Methods

Rheumatoid arthritis (RA) patients participating in the Consortium of Rheumatology Researchers of North America (CORRONA) registry were linked to Medicare data restricted to rheumatology claims or claims for RA. Deterministic linkage was performed using age, sex, provider identification number, and geographic location of the CORRONA site. We then searched for visit dates in Medicare matching visit dates in CORRONA, requiring at least 1 exact matching date. Linkage accuracy was quantified as a positive predictive value (PPV) in a sub-cohort (n=1581) with more precise identifiers.

Results

CORRONA participants with self-reported Medicare (n=11,001) were initially matched to 30,943 Medicare beneficiaries treated by CORRONA physicians. A total of 8,431 CORRONA participants matched on at least 1 visit; 5,317 matched uniquely on all visits. The number of patients who linked and linkage accuracy (from the subcohort) was high for patients with >2 visits (n=3458, 98% accuracy), exactly 2 visits (n=822, 96% accuracy) visits, and 1 visit (n=1037, 79% accuracy) visit that matched exactly on calendar date. Demographics and comorbidity profiles of registry participants were similar to non-participants, except participants were more likely to use DMARDs and biologics.

Conclusion

Linkage between a national, de-identified outpatient arthritis registry and Medicare data on multiple non-unique identifiers appears feasible and valid.

Keywords: rheumatoid arthritis, registry, cohort, administrative data, generalizability, linkage

Background

All data sources have strengths and weaknesses that must be considered in selecting one to conduct comparative effectiveness research. Registry or cohort data typically excel in capturing clinical and phenotypic information. However, they are sometimes limited in the frequency and scope of data collection due to concerns about participant burden and cost. Large administrative claims databases have been shown to be valuable and have been widely used to study comparative effectiveness research questions, especially for safety outcomes (1). Among their strengths are large sample sizes that allow for examination of rare exposures and outcomes, and their comprehensive capture of healthcare services provided to an individual, as well as healthcare costs. However, they generally lack detailed clinical information, and thus, concerns for residual confounding due to disease activity, severity, and other unmeasured factors need to be considered. A `hybrid' approach that would link different types of data sources together would therefore likely be valuable to overcome the limitations inherent to any type of data source (24). For example, an analysis conducted using an arthritis registry with data collected at physician office visits would be benefitted by linkage to an administrative database to have a more complete understanding of patients' medication adherence, both to their arthritis drugs and non-arthritis related medications. It also would allow for more complete long term follow-up of safety events of interest (e.g. hospitalization for myocardial infarction, incident malignancy) for safety follow-up and pharmacovigilance purposes, even if patients changed arthritis providers or no longer had any engagement with the registry.

A number of published examples exist where administrative data has been linked to a cohort or registry (5). In terms of the technical requirements for such linkages, having unique identifiers that are considered protected health information (PHI) such as social security numbers, date of birth (DOB), and sex are generally sufficient to permit linkages with high validity (68). Several examples of linkages between administrative data and inpatient procedure or device registries where unique PHI is not available also have been published. Several of these examples have linked on hospital center and other non-unique information (e.g. dates of admission/discharge) (6, 9, 10). It has been shown that this deterministic linkage using multiple non-unique identifiers produce highly accurate linkage with >95% sensitivity and specificity >98% compared to linkage using unique identifiers (11). However, in an outpatient registry, many individuals participating may not undergo hospitalization or a medical procedure, and these methods may not be sufficient to link ambulatory cohorts and registries to administrative claims data. Moreover, little guidance exists about the optimal methods or success of linking an outpatient cohort or registry to administrative claims data when unique identifier is not available.

In light of this evidence gap in linkage methodology, we sought to link a large, de-identified, outpatient registry of patients with rheumatoid arthritis (RA) with national Medicare administrative claims data. The purpose of this report is to describe the methods and validity of this linkage. Further, administrative data were used to compare characteristics of registry participants to 1) non-participants with RA, treated by the same rheumatologists, to assess for a potential selection bias in those who were enrolled in the registry and 2) non-participants with RA treated by rheumatologists not participating in the registry, to assess the national generalizability of RA patients enrolled in the registry compared to the remainder of the U.S.

Methods

Data Sources

We used data from the Center for Medicare and Medicaid Services (CMS) from 2006–2010 and selected all beneficiaries with at least one outpatient physician diagnosis of RA (International Classification of Diseases, Ninth Revision [ICD-9] 714.X). Only a single claim with an RA diagnosis code was required intentionally to maximize sensitivity. Medicare enrollees were classified as “observable” in the Medicare data if they had both Medicare part A (inpatient) and part B (outpatient) coverage and were not enrolled in Medicare Advantage. Claims for Medicare Advantage enrollees and for those with only part A or part B are typically incomplete. These data are considered “research identifiable” given the possibility that individuals could be uniquely identified, either in the data by itself or through linkage procedures of the type that we implemented. Governance of the data is under the aegis of a Data Use Agreement (DUA) from CMS that specifies what may be done with the data and who has access to it. The default DUA language generally prohibits linkage to any external data source without explicit approval from CMS, which was obtained for this project. The DUA also prohibits public disclosure of any information about groups of fewer than 11 people to avoid concerns about patient privacy.

We also used 2006–2010 data for RA patients participating in the Consortium of Rheumatology Researchers of North America (CORRONA) arthritis registry (12). CORRONA is a registry of patients with RA and psoriatic arthritis (PsA) as confirmed by rheumatologists; only participants with RA were selected. Patients explicitly consent to participate in the registry, although the possibility that the registry data might be linked to external data sources was not envisioned at the time the study protocol was created. For the purposes of this analysis, patients were considered eligible for linkage to Medicare after they first self-reported that they had Medicare coverage (of any type, including Medicare Advantage) at a CORRONA visit. Both that visit and all subsequent CORRONA visits were eligible for inclusion in the linkage procedure.

Linkage Procedures

Deterministic linkage, or rule-based record linkage, finds matches based on a set of rules about linkage variables, and was performed using multiple non-unique variables. The key variables included year of birth, sex, physician specialty (rheumatologist or not), state that the CORRONA provider practiced in, specific CORRONA provider identification number (UPIN/NPI) (13), and calendar dates of CORRONA provider office visits. The initial approach required exact matches between the calendar dates of all provider visits in the Medicare data and corresponding dates in the CORRONA data.

If a registry participant matched to more than one specific Medicare beneficiary, or vice-versa, priority was given to the matched pair with the greatest number of provider visits. For example, if a given registry participant matched 2 Medicare beneficiaries, one with 5 matched visits and another with only 1 matched visit, the pair with matched 5 visits would be considered correctly matched. If a registry participant matched to more than one Medicare beneficiary with the same number of visits, these `tied' pairs were considered unmatched. After selecting matches as described above, we implemented a procedure that allowed for `fuzzy matching' on visit dates with CORRONA providers and that accepted a mismatch of visit dates by up to ±4 days, and these were included in the primary analysis. Additional variations on mismatch in visit dates (e.g. allowing mismatch in a single digit of visit month or day) were explored but did not appear fruitful in preliminary analyses and were not considered further. Other factors that were considered as potential linkage variables and were available within both the registry and Medicare data (e.g. specific drug exposures, such as biologic medications) were either used too infrequently or imposed further restrictions on the sample size (e.g. because Medicare Part D prescription drug coverage was required) that they were deemed impractical.

Linkage accuracy was evaluated in a subcohort of 1,581 CORRONA participants with more complete information that included full date of birth and/or social security number. These individuals explicitly consented to the linkage and made this additional PHI available for this purpose. Individuals who matched on both month and year of birth, or who matched on full social security number, were considered as accurately matched.

Statistical analysis

Descriptive statistics for the cohort were used to evaluate patients' characteristics according to the accuracy of linkage. Linkage accuracy was defined at the level of groups of similar patients who were classified by the linkage procedure into the same category, as depicted in a flow diagram figure. Information from the validation subcohort with exact date of birth (not just year) and/or SSN available was used as the gold standard for linkage accuracy in the terminal categories in the flow diagram, each of which was shown using both colors and shading. Accuracy was reported as a positive predictive value (PPV) based upon results from the subcohort, and 95% confidence intervals were calculated using a binominal distribution. Characteristics of CORRONA patients who were linked with high accuracy (i.e. they were linked by the matching algorithm with an accuracy >= 80%), lower accuracy (<80%), or not linked were examined descriptively using CORRONA registry data to evaluate whether there were differences between patients who were accurately linked and those who were not.

Evaluation of the generalizability of the registry used the most recent twelve months of Medicare data to examine characteristics of the CORRONA patients (linked with >= 80% accuracy) to patients of the same CORRONA rheumatologists who were not enrolled in the registry (and did not even have a single exactly matched visit) and to RA patients of rheumatologists who did not participate in CORRONA. This comparison was restricted to patients with 3 or more Medicare claims with RA diagnoses (714.0) from a rheumatologist office visit to improve the specificity of the RA diagnosis in the Medicare data. In this analysis, 98% of all linked CORRONA registry participants met this additional criterion. Standardized mean differences (SMDs) were computed to evaluate the magnitude of differences between groups. SMDs of <0.1 were considered clinically unimportant (14). Analyses were conducted in SAS 9.3. The linkage was approved by the CORRONA central institutional review board and by the UAB institutional review board (IRB) without the requirement to obtain additional patient consent.

Results

In the Medicare data, 64,229 patients had at least one RA diagnosis from one of the 387 rheumatologists registered with CORRONA. A total of 11,001 CORRONA participants with rheumatologist-confirmed RA had self-reported Medicare coverage of any type and were potentially linkable. Of these, 10,050 CORRONA participants were preliminarily linked to 1 or more Medicare beneficiaries from a pool of 30,943 Medicare beneficiaries based on matching on year of birth, sex, and CORRONA provider number, as shown in the flow diagram in Figure 1.

Figure 1.

Figure 1

Flow Diagram Describing the Approach to Link Medicare Data to an Outpatient Arthritis Registry on Multiple, Non-Unique Factors

In order to improve the specificity of the linkage, Medicare claims were subsequently restricted to those from a rheumatologist or where the primary diagnosis was RA. Visits with providers were restricted to those occurring in the same state as the CORRONA provider. Following this restriction, there were 1,619 CORRONA participants who did not have at least one visit that matched exactly on calendar date in both the Medicare data and the CORRONA data, and these individuals were removed from further consideration.

The remaining 8,431 CORRONA participants were potentially linkable to a pool of 12,877 Medicare enrollees based upon having at least one provider visit match exactly on calendar date. Of these, 5,756 CORRONA participants linked to 5,979 Medicare enrollees on every visit that occurred in both CORRONA and Medicare data. The median (IQR) number of matched visits was 3 (1, 7). Not all matches were unique, as 502 patients tied with another CORRONA participant or Medicare beneficiary. After prioritizing those with the greatest number of matched visits over potential matches with fewer matched visits, 51 and 12 patients had at least 2 provider visits and were moved to the uniquely matched population. The remaining 439 tied individuals were considered unmatched.

After resolving ties, 5,317 patients were considered uniquely linked. Among these individuals, 3,458 matched exactly on more than 2 visits, shown in the red horizontal-shaded category in Figure 1. Of these, 972 were in the validation subcohort with additional identifying information. Based upon the results for patients classified into this category from the validation subcohort (n=972 individuals), linkage accuracy (i.e. PPV) for patients in this category was 956/972=98.4% (95% CI 97% – 99%). For those who matched on exactly 2 visits (n=822 in the entire cohort, 212 in the validation subcohort; blue vertically-shaded category in Figure 1), PPV was 96.0% (95% CI 93% – 98%). For those who matched on only 1 visit (n=1037, 212 in the validation subcohort; green diagonal-shaded category in Figure 1), linkage accuracy was 78.9% (95% CI 72.4% – 84.4%). If one required that patients match on month, day and year of birth rather than only on month and year, linkage accuracy was approximately 3–4% lower.

Among the 2,675 patients who matched on exact calendar dates for some but not all visits, an additional 524 matched pairs were matched on all visits after allowing mismatch by up to 4 calendar days. For those matching on more than 2 visits (n=434, 159 in the validation subcohort; orange hatched category in Figure 1), linkage accuracy was 88.7% (95% CI 82.7% – 93.2%). For those matching on exactly 2 visits (n=90, with 26 in the validation subcohort; yellow speckled category in Figure 1), linkage accuracy was 34.6% (95% CI 17.2% – 55.7%).

Characteristics of Medicare enrollees who linked with high accuracy (linkage accuracy >= 80%, n=3458+822+434), lower accuracy (< 80%, n=1037+90), or not at all (n=1619) are shown in Table 1. Linked patients were somewhat older and more likely to be female. They had a longer disease duration of RA and were more likely to have used biologics. In total, although differences were observed in RA-related and other clinical characteristics, the magnitude of other differences was generally small. Characteristics of the remaining patients not categorized as linked or definitively not linked is shown in the Appendix Table. In general, these patients' characteristics were intermediate between patients linked with high accuracy and those linked with lower accuracy.

Table 1.

Descriptive Characteristics from Registry Data of Medicare Enrollees with Rheumatoid Arthritis Who Were Linked to Registry Data with High Accuracy, Lower Accuracy, or Not at AN*

Patients who were not linked** SMD vs. Linked ≥ 80% column Patients Linked with < 80% Accuracy SMD vs. Linked ≥ 80% column Patients Linked with ≥ 80% Accuracy
N 1,619 1,127 4,714
Age Mean (std) 67.47 (11.93) 0.5301 69.47 (10.61) 0.368 73.17 (9.3)
Gender Female 1097 (67.8%) 0.1644 820 (72.8%) 0.0547 3543 (75.2%)
Race White 1313 (81.8%) 0.1368 945 (83.9%) 0.0793 4052 (86.7%)
Hispanic 109 (6.8%) 0.1555 49 (4.4%) 0.0503 158 (3.4%)
Black 134(8.3%) 0.0332 91 (8.1%) 0.0237 348 (7.4%)
Asian 21 (1.3%) 0.0303 18 (1.6%) 0.0544 46 (1.0%)
Other 29 (1.8%) 0.0277 23 (2.0%) 0.0448 68 (1.5%)
0.1672 0.1593
Body Mass Index Mean (std) 29.10 (7.05) 0.5301 29.05 (7.1) 0.368 27.93 (6.87)
RA-Related Characteristics
Duration of RA in years + Mean (std) 13.15 (10.37) 0.3167 12.78 (11.3) 0.335 16.63 (11.60)
CDAI Mean (std) 11.93 (11.75) 0.1629 12.66 (12.02) 0.2259 10.1 (10.17)
HAQ-DI Mean (std) 0.46 (0.55) 0.0065 0.53 (0.56) 0.1136 0.47 (0.55)
ESR Mean (std) 27.41 (25.19) 0.0546 27.87 (23.02) 0.0376 28.75 (23.92)
Anti-CCP antibody % 360 (64.3%) 0.0694 272 (65.7%) 0.0398 992 (67.6%)
History of biologic use % 876 (54.1%) 0.1527 579 (51.4%) 0.2079 2905 (61.6%)
RF positive % 626 (71.7%) 0.0556 421 (72.7%) 0.0332 2094 (74.2%)
History of DMARD use % 1553 (95.9%) 0.204 1059 (94.0%) 0.2825 4671 (99.1%)
Patient Fatigue Mean (std) 34.42 (30.17) 0.0243 36.97 (29.11) 0.0864 34.78 (28.82)
Patient Global Assessment Mean (std) 33.64 (28.54) 0.0426 35.48 (28.11) 0.1137 31.80 ( 27.02)
Patient pain Mean (std) 35.59 (29.41) 0.0472 38.41 ( 29.32) 0.153 33.19 (28.03)
Comorbidities
Chronic Heart Failure % 32 (3.3%) 0.0334 26 (3.4%) 0.0265 88 (3.9%)
COPD % 61 (4.6%) 0.131 59 (7.1%) 0.232 94 (2.3%)
Diabetes % 185 (11.4%) 0.0257 153 (13.6%) 0.0909 500 (10.6%)
History of hospitalized infection while under observation % 86 (6.5%) 0.0587 75 (9.0%) 0.035 337 (8.1%)
MI % 99 (6.1%) 0.0034 57 (5.1%) 0.0426 284 (6.0%)
Stroke % 65 (4.0%) 0.0653 54 (4.8%) 0.0274 254 (5.4%)
Smoker % 216 (21.6%) 0.1132 134 (17.2%) 0.0002 396 (17.2%)

SMD = standardized mean difference; CDAI = Clinical Disease Activity Index; HAQ-DI = Health Assessment Questionnaire Disability Index; ESR = Sedimentation Rate; CCP = Cyclic Citrullinated Peptide; COPD = Chronic Obstructive Pulmonary Disease

Note: All factors were measured at the most recent CORRONA visit

*

high accuracy describes individuals in the colored and shaded categories of the Figure with a linkage PPV >80%. Lower accuracy describes participants in all other terminal categories of the Figure where a PPV is reported.

**

not considered matched by linkage procedure

The comparison of the generalizability of CORRONA RA patients to non-CORRONA RA patients of CORRONA physicians, and to RA patients of non-CORRONA rheumatologists, is shown in Table 2. CORRONA patients had similar demographics to the other two groups of RA patients although were more likely to be recruited from the northeast region of the U.S. They were similar in their prevalence of various comorbidities and in healthcare utilization. However, RA patients of CORRONA doctors were more likely to receive DMARD and biologic therapies. Indeed, 85% of CORRONA-enrolled RA patients used a DMARD or biologic, as compared with 73% of RA patients treated by non-CORRONA rheumatologists.

Table 2.

Characteristics* of RA Patients of CORRONA doctors in CORRONA vs. not in CORRONA vs. Remainder of U.S as Identified in Medicare Administrative Data

RA Patients of non-CORRONA doctors SMD vs. patients of CORRONA doctors, linked to CORRONA RA Patients of CORRONA doctors, not linked to CORRONA SMD vs. patients of CORRONA doctors, linked to CORRONA RA Patients of CORRONA doctors, in CORRONA
Number of RA patients, n (%)
446274 23511 5532
Number of treating rheumatologists, n 5,111 310 280
Age mean(std) 72.62 (10.82) 0.0219201 72.40 (12.27) 0.0002204 72.39 (9.85)
Sex Female 332784 (74.7%) 0.0023114 16763 (71.4%) 0.0716897 4123 (74.6%)
Race White 378675 (85.0%) 0.1366321 20226 (86.2%) 0.1033576 4949 (89.5%)
Black 42520 (9.5%) 0.067486 2438 (10.4%) 0.0955293 423 (7.7%)
Hispanic 11230 (2.5%) 0.1695285 318 (1.4%) 0.0931023 26 (0.5%)
Native American 2891 (0.6%) 0.0187789 98 (0.4%) 0.0131362 28 (0.5%)
Asian Other 4815 (1.1%) 0.0398825 172 (0.7%) 0.0032143 39 (0.7%)
5375 (1.2%) 0.007899 220 (0.9%) 0.0182765 62 (1.1%)
Region MidWest 102673 (24.5%) 0.0139384 3249 (14.5%) 0.2693429 1350 (25.1%)
North East 69875 (16.7%) 0.3212634 7030 (31.3%) 0.025934 1619(30.1%)
South 177853 (42.4%) 0.2147475 10134 (45.1%) 0.2695225 1727 (32.1%)
West 68825 (16.4%) 0.1051288 2055 (9.1%) 0.1145216 684 (12.7%)
Rural residence 105156 (24.9%) 0.0756817 5187 (23.1%) 0.0324153 1169(21.7%)
Comorbidities
COPD 64665 (14.5%) 0.0268574 3377 (14.4%) 0.0232578 750 (13.6%)
Diabetes 94805 (21.2%) 0.0825734 4919 (20.9%) 0.0746882 994 (18.0%)
Cancer 50530 (11.3%) 0.0122488 2709 (11.5%) 0.0059731 648 (11.7%)
MI 6855 (1.5%) 0.0151046 376 (1.6%) 0.0201834 75 (1.4%)
Charlson Comorbidity Index mean(std) 0.94 (1.75) 0.0720511 0.95 (1.74) 0.0785728 0.82 (1.63)
Health Care Utilization
 Number of physician visits 14.00 (10.63) 0.0128949 14.52 (10.73) 0.0627751 13.87 (10.12)
 Any hospitalization, % 130276 (29.2%) 0.0666433 7050 (30.0%) 0.0840562 1450 (26.2%)
 Any joint surgery 28216 (6.3%) 0.0234481 1522 (6.5%) 0.0172803 382 (6.9%)
Patients with full-year part D, n (%) 246133 (55%) 12698 (54%) 2812 (51%)
Any non-biologic DMARD**, % 156650 (63.6%) 0.2079756 8119 (63.9%) 0.2017908 2060 (73.3%)
Methotrexate 114052 (46.3%) 0.2351051 6083 (47.9%) 0.2033205 1631 (58.0%)
Sulfasalazine 15620 (6.3%) 0.0124709 643 (5.1%) 0.0428719 170 (6.0%)
Leflunomide 23747 (9.6%) 0.0631645 1089 (8.6%) 0.1003173 326 (11.6%)
Hydroxychloroquine 52122 (21.2%) 0.0682692 2233 (17.6%) 0.0226667 519 (18.5%)
Any Biologic**, % 59920 (24.3%) 0.2006718 3364 (26.5%) 0.1510876 939 (33.4%)
Abatacept 9971 (4.1%) 0.174194 746 (5.9%) 0.0915313 231 (8.2%)
Adalimumab 11692 (4.8%) 0.0483425 672 (5.3%) 0.0235619 164 (5.8%)
Certolizumab 1416 (0.6%) 0.0642908 130 (1.0%) 0.0143671 33 (1.2%)
Etanercept 15071 (6.1%) 0.0601676 808 (6.4%) 0.0502699 215 (7.6%)
Golimumab 1080 (0.4%) 0.0759459 68 (0.5%) 0.0629324 31 (1.1%)
Infliximab 25151 (10.2%) 0.2619993 1622 (12.8%) 0.1824123 547 (19.5%)
Rituximab 5141 (2.1%) 0.0791542 307 (2.4%) 0.0572912 95 (3.4%)
Tocilizumab 219 (0.1%) 0.0156767 33 (0.3%) 0.0262631 4 (0.1%)
Any DMARD or Biologic**, % 179716 (73.0%) 0.2893637 9425 (74.2%) 0.2617548 2382 (84.7%)
Prednisone**, % 163678 (66.5%) 0.0825578 8663 (68.2%) 0.1193972 1759 (62.6%)

SMD = Standardized mean difference; n/a = not applicable

*

all factors described were measured during the most recent 1 year of Medicare data available.

**

assessed in Medicare part D pharmacy data in the subgroup of patients with 12 months of part D coverage

Discussion

Using a largely de-identified outpatient arthritis registry of participants with RA, we successfully linked almost 6,000 individuals to national Medicare administrative claims data from CMS. Linkage accuracy was shown to be 96–98% for patients who linked uniquely on 2 or more visits occurring both in the registry and the Medicare data. For patients who matched on exactly 1 visit, 79% of patients were accurately linked. For patients who linked on some but not all visits but had mismatch of only a few days between visits, those who had more than 2 visits in the registry and linked on all visits were matched with 89% accuracy. Based upon the overall linkage accuracy that we found for large numbers of patients in the registry, these results support the validity of using such methods to harness the power of a disease registry and administrative data together, even if PHI is not readily available.

The potential uses of linkages between disparate data sources are myriad. Administrative data can serve to complement a registry to enhance the spectrum of exposures (e.g. non-rheumatology-related medication exposures) and outcomes (e.g. complete healthcare utilization, costs) that can be analyzed but which the registry cannot feasibly collect due to participant burden. It can also enhance case ascertainment for events of interest (7) to capture both underreported events while patients are under follow-up by the registry. It would also allow continued observation of patients who are otherwise lost to follow-up from the perspective of the registry and support pharmacovigilance efforts and large pragmatic trials. Conversely, the linked registry data may be used to complement administrative data analyses to provide information about potentially confounding variables (e.g. seropositivity, disease activity, extent of disability, disease duration, tobacco use) that would otherwise be missing or poorly reported. Registry data can also overcome limitations due to `left-censoring' of administrative data whereby events occurring prior to the start of observation in administrative data are unknown, since a registry can capture a lifetime history of key health events and exposures of interest.

Prior literature has shown similarly high success as we found in linking registries to administrative data and examining outcomes, although most examples linked to an inpatient hospital or procedure-based registry deterministically based upon factors including date of birth, sex, and hospital admission/discharge dates or procedure dates (1517). Examples of additional data elements that may be useful for linking to external data include social security numbers, name, and full residential address. Linkage rates may also be improved if selected additional elements are available such as specific insurance plan name (rather than simply specifying `commercial insurance'), insurance plan number, and similar details for prescription pharmacy plan coverage captured distinct from patients' medical insurance information.

With linkage to a nationally-representative data source like Medicare, linkages can be used to compare the generalizability of participants in the registry to non-participants, as we did in Table 2. In general, the demographics and comorbidity profiles were quite comparable for patients enrolled in CORRONA versus those not enrolled. However, CORRONA patients were more likely to be treated more with DMARDs and biologics compared to RA patients of non-CORRONA physicians. Some misclassification may exist in classifying a patient as having RA based only a single RA diagnosis code from a rheumatologist (18, 19), although one study found that both the sensitivity and the positive predictive value of this approach was 90% (19). However, our analysis that required to 3 RA diagnosis codes from a rheumatologist in one year in part addressed this issue and showed that 73% of RA patients of non-CORRONA rheumatologists were treated with DMARDs or biologics. While even this proportion receiving treatment may seem low, this estimate is comparable to and higher than national results derived from Medicare managed care plans, where only 63% of RA patients received these therapies (20). In that study, large variability in treatment was observed based on patient and health-plan related characteristics. It was also similar to results from RA patients represented in the Medicare Current Beneficiary Survey(21). The higher prevalence of biologic and DMARD use in linked CORRONA registry participants (85%) may reflect design features of the registry itself. CORRONA participants typically have visits at approximately 4 month intervals unless they are starting or changing therapies for RA, in which case they can be enrolled with an earlier visit interval. Results of the prevalence of DMARD and biologic use (74%) of RA patients treated by CORRONA rheumatologists but not enrolled in the registry were remarkably similar to the corresponding prevalence of RA patients of non-CORRONA rheumatologists (73%). This finding suggests that patients with more aggressively treated disease may be somewhat more likely to be enrolled than patients with more quiescent disease who are not making any changes to their RA medication regimen.

Our study is novel in that it links together a largely de-identified outpatient arthritis registry with 100% of CMS data for RA patients over a four year period. Importantly, we were able to validate our linkage procedures in a large subcohort of more than 1500 participants who explicitly provided consent and made available additional information (e.g. full date of birth, social security number) to confirm the linkage. Despite these strengths, some limitations to this work need to be considered. First, the arthritis registry did not ask participants to self-report traditional fee-for-service Medicare as distinct from Medicare Advantage. Thus, some participants who said they had Medicare coverage were not linkable in as much as only traditional fee-for-service enrollees have outpatient claims-level information available within CMS data. Additionally, we did not evaluate accuracy using unique identifiers even for patients in our subcohort but relied on matching date of birth; however, a previous validation study yielded >95% sensitivity and >98% specificity for correct matches using date of birth via a similar approach (11). We also recognize that deterministic linkage on multiple factors yielded high accuracy for individuals with 2 or more visits (red, horizontal shaded cell; and blue, vertical shaded cell), but was lower for individuals who did not match uniquely on all visits, or who had only a single matched registry visit. High specificity of linkage will be more important than high sensitivity for most applications using linked data, since it is better to not link an individual (yielding `false negatives') if it results in incorrect linkages (`false positives'). Requiring linkage accuracy of 90–95% or greater may be a reasonable requirement, and this level of accuracy was achievable for large numbers of individuals in this cohort.

Other methods for linkage exist, such as estimating a person-level probability of being accurately linked, rather than using a rules-based approach where accuracy was determined at a group level for all individuals meeting those rules(22, 23)). Using these methods in conjunction with additional data elements might be useful in some applications, but this was not feasible in this study given that few additional data elements were available within both the registry and administrative data with high enough prevalence. Additionally, not all patients ended up in classifiable groups with sufficient numbers from our validation subcohort to characterize these individuals as validly matched or not. For example, a participant with 5 visits in the registry that still had at least one mismatch in visit dates after the fuzzy matching procedure was not included in any classifiable group since we could not estimate the accuracy for these individuals without additional information. Attempting to represent these individuals as either validly matched or not was therefore not feasible. We expect some of these patients would be linkable if additional information were available. The Appendix Table showed their characteristics were similar to linked patients.

We found that linked patients were older and included proportionally more women than unlinked patients, which may reflect left-censoring of the Medicare data. This may be a consequence of younger patients being more likely to choose insurance other than Medicare fee-for-service coverage. It also may reflect that it may take several years for patients to accrue enough matchable visits in both data sources to allow for linkage using these methods. While the age and sex differences between matched and unmatched patients should not create bias, it could impact the external generalizability of the matched cohort. Finally, although we showed that patients linked by the algorithm were generally linked accurately, the sensitivity of the approach was not able to be assessed, in as much as it was unclear how many patients should have been linkable. There were 10,050 CORRONA patients with RA who reported any type of Medicare coverage, and 5,317 (53%) were uniquely linked. However, CORRONA does not distinguish between traditional Medicare fee-for-service coverage and Medicare Advantage, and Medicare Advantage claims are not available in these CMS data. In 2011, 75% of all U.S. Medicare beneficiaries were enrolled with Medicare fee-for-service coverage (24). Applying that proportion to the 10,050 CORRONA patients reporting any Medicare Coverage, the sensitivity of the method tentatively might be estimated as (0.53 / 0.75) = 0.71. A more refined approach to calculate sensitivity could be possible if patients are able to accurately self-report whether they have Medicare fee-for-service or Medicare Advantage coverage.

In conclusion, linkage between a largely de-identified national U.S. outpatient arthritis registry and U.S. Medicare claims data using multiple, non-unique identifiers appears both feasible and valid. The approach that matches partially de-identified data based upon on healthcare utilization within physician practices is likely to be useful in future studies even when individual PHI is unavailable.

Supplementary Material

01

Significance and Innovation.

  • Methods to link de-identified outpatient registry data with health plan data (e.g. administrative claims, electronic health records) have not been well studied

  • Using deterministic linkage on multiple, non-unique identifiers (or variables), the CORRONA rheumatoid arthritis patient registry was linked with Medicare administrative claims data. High accuracy of the linkage was demonstrated in a subcohort of patients that had more detailed (or precise) information on linkage variables available.

  • The representativeness of the registry population was evaluated by comparing characteristics of registry participants to Medicare RA patients treated by non-CORRONA physicians. The registry participants had good representativeness of general Medicare RA patients, except that registry participants were somewhat more likely to use biologics and DMARDs.

Acknowledgments

Funding: This work was supported by the Agency for Healthcare Research and Quality (AHRQ) (R01HS018517). Dr. Curtis receives support from the NIH (AR 053351).

Footnotes

Disclosures: JRC received consulting fees and research grants from Amgen, Abbott, BMS, Pfizer, Eli Lilly, Janssen, UCB, Roche/Genentech and CORRONA.

References

  • 1.Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 2005;58(4):323–37. doi: 10.1016/j.jclinepi.2004.10.012. [DOI] [PubMed] [Google Scholar]
  • 2.Weiss NS. The new world of data linkages in clinical epidemiology: are we being brave or foolhardy? Epidemiology. 2011;22(3):292–4. doi: 10.1097/EDE.0b013e318210aca5. [DOI] [PubMed] [Google Scholar]
  • 3.Sturmer T, Jonsson Funk M, Poole C, Brookhart MA. Nonexperimental comparative effectiveness research using linked healthcare databases. Epidemiology. 2011;22(3):298–301. doi: 10.1097/EDE.0b013e318212640c. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Potosky AL, Riley GF, Lubitz JD, Mentnech RM, Kessler LG. Potential for cancer related health services research using a linked Medicare-tumor registry database. Med Care. 1993;31(8):732–48. [PubMed] [Google Scholar]
  • 5.Virnig B, Durham SB, Folsom AR, Cerhan J. Linking the Iowa Women's Health Study cohort to Medicare data: linkage results and application to hip fracture. Am J Epidemiol. 172(3):327–33. doi: 10.1093/aje/kwq111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Jacobs JP, Edwards FH, Shahian DM, Haan CK, Puskas JD, Morales DL, et al. Successful linking of the Society of Thoracic Surgeons adult cardiac surgery database to Centers for Medicare and Medicaid Services Medicare data. Ann Thorac Surg. 2010;90(4):1150–6. doi: 10.1016/j.athoracsur.2010.05.042. discussion 6–7. [DOI] [PubMed] [Google Scholar]
  • 7.Schousboe JT, Paudel ML, Taylor BC, Virnig BA, Cauley JA, Curtis JR, et al. Magnitude and consequences of misclassification of incident hip fractures in large cohort studies: the Study of Osteoporotic Fractures and Medicare claims data. Osteoporos Int. 2012 doi: 10.1007/s00198-012-2210-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bradley CJ, Given CW, Luo Z, Roberts C, Copeland G, Virnig BA. Medicaid, Medicare, and the Michigan Tumor Registry: a linkage strategy. Med Decis Making. 2007;27(4):352–63. doi: 10.1177/0272989X07302129. [DOI] [PubMed] [Google Scholar]
  • 9.Hammill BG, Hernandez AF, Peterson ED, Fonarow GC, Schulman KA, Curtis LH. Linking inpatient clinical registry data to Medicare claims data using indirect identifiers. Am Heart J. 2009;157(6):995–1000. doi: 10.1016/j.ahj.2009.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Curtis LH, Greiner MA, Hammill BG, DiMartino LD, Shea AM, Hernandez AF, et al. Representativeness of a national heart failure quality-of-care registry: comparison of OPTIMIZE-HF and non-OPTIMIZE-HF Medicare patients. Circ Cardiovasc Qual Outcomes. 2009;2(4):377–84. doi: 10.1161/CIRCOUTCOMES.108.822692. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Setoguchi S, Zhu Y, Jalbert JJ, Williams L, Chen CY. Validity of Deterministic Record Linkage Using Multiple Indirect Personal Identifiers: Linking a Large Registry to Claims Data. Circ CCQO. 2014 doi: 10.1161/CIRCOUTCOMES.113.000294. In Press. [DOI] [PubMed] [Google Scholar]
  • 12.Kremer JM. The CORRONA database. Clin Exp Rheumatol. 2005;23(5 Suppl 39):S172–7. [PubMed] [Google Scholar]
  • 13.Adhikesavan LG, Newman ED, Diehl MP, Wood GC, Bili A. American College of Rheumatology quality indicators for rheumatoid arthritis: benchmarking, variability, and opportunities to improve quality of care using the electronic health record. Arthritis Rheum. 2008;59(12):1705–12. doi: 10.1002/art.24054. [DOI] [PubMed] [Google Scholar]
  • 14.Roland M, Torgerson DJ. What are pragmatic trials? BMJ. 1998;316(7127):285. doi: 10.1136/bmj.316.7127.285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Hess PL, Greiner MA, Fonarow GC, Klaskala W, Mills RM, Setoguchi S, et al. Outcomes associated with warfarin use in older patients with heart failure and atrial fibrillation and a cardiovascular implantable electronic device: findings from the ADHERE registry linked to Medicare claims. Clin Cardiol. 2012;35(11):649–57. doi: 10.1002/clc.22064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Margulis AV, Setoguchi S, Mittleman MA, Glynn RJ, Dormuth CR, Hernandez-Diaz S. Algorithms to estimate the beginning of pregnancy in administrative databases. Pharmacoepidemiol Drug Saf. 2013;22(1):16–24. doi: 10.1002/pds.3284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Li Q, Glynn RJ, Dreyer NA, Liu J, Mogun H, Setoguchi S. Validity of claims-based definitions of left ventricular systolic dysfunction in Medicare patients. Pharmacoepidemiol Drug Saf. 2011;20(7):700–8. doi: 10.1002/pds.2146. [DOI] [PubMed] [Google Scholar]
  • 18.Kim SY, Servi A, Polinski JM, Mogun H, Weinblatt ME, Katz JN, et al. Validation of rheumatoid arthritis diagnoses in health care utilization data. Arthritis Res Ther. 2011;13(1):R32. doi: 10.1186/ar3260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Katz JN, Barrett J, Liang MH, Bacon AM, Kaplan H, Kieval RI, et al. Sensitivity and positive predictive value of Medicare Part B physician claims for rheumatologic diagnoses and procedures. Arthritis Rheum. 1997;40(9):1594–600. doi: 10.1002/art.1780400908. [DOI] [PubMed] [Google Scholar]
  • 20.Schmajuk G, Trivedi AN, Solomon DH, Yelin E, Trupin L, Chakravarty EF, et al. Receipt of disease-modifying antirheumatic drugs among patients with rheumatoid arthritis in medicare managed care plans. JAMA. 2011;305(5):480–6. doi: 10.1001/jama.2011.67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Harrold LR, Peterson D, Beard AJ, Gurwitz JH, Briesacher BA. Time trends in medication use and expenditures in older patients with rheumatoid arthritis. Am J Med. 2012;125(9):937, e9–15. doi: 10.1016/j.amjmed.2011.11.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Li X, Shen C. Linkage of patient records from disparate sources. Stat Methods Med Res. 2011 doi: 10.1177/0962280211403600. [DOI] [PubMed] [Google Scholar]
  • 23.Tromp M, Ravelli AC, Bonsel GJ, Hasman A, Reitsma JB. Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. J Clin Epidemiol. 2011;64(5):565–72. doi: 10.1016/j.jclinepi.2010.05.008. [DOI] [PubMed] [Google Scholar]
  • 24.Medicare Advantage 2012 Data Spotlight: Enrollment Market Update. Available from http://kff.org/health-costs/report/medicare-advantage-2012-enrollment-market-update/

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01

RESOURCES