Skip to main content
Open Forum Infectious Diseases logoLink to Open Forum Infectious Diseases
. 2025 Jan 23;12(1):ofaf021. doi: 10.1093/ofid/ofaf021

Coronavirus Disease 2019 (COVID-19) Real World Data Infrastructure: A Big-Data Resource for Study of the Impact of COVID-19 in Patient Populations With Immunocompromising Conditions

James M Crawford 1,, Lynne Penberthy 2,, Ligia A Pinto 3, Keri N Althoff 4, Magdalene M Assimon 5, Oren Cohen 6,, Laura Gillim 7, Tracy L Hammonds 8, Shilpa Kapur 9, Harvey W Kaufman 10,, David Kwasny 11, Jean W Liew 12, William A Meyer III 13, Shannon L Reynolds 14, Cheryl B Schleicher 15, Suki Subbiah 16, Catherine Theruviparampil 17, Zachary S Wallace 18, Jeremy L Warner 19,20, Suhyeon Yoon 21, Yonah C Ziemba 22,4
PMCID: PMC11756308  PMID: 39850579

Abstract

Background

We developed a United States–based real-world data resource to better understand the continued impact of the coronavirus disease 2019 (COVID-19) pandemic on immunocompromised patients, who are typically underrepresented in prospective studies and clinical trials.

Methods

The COVID-19 Real World Data infrastructure (CRWDi) was created by linking and harmonizing de-identified HealthVerity medical and pharmacy claims data from 1 December 2018 to 31 December 2023, with severe acute respiratory syndrome coronavirus 2 virologic and serologic laboratory data from major commercial laboratories and Northwell Health; COVID-19 vaccination data; and, for patients with cancer, 2010 to 2021 National Cancer Institute Surveillance, Epidemiology, and End Results registry data.

Results

The CRWDi contains 4 cohorts: patients with cancer; patients with rheumatic diseases receiving pharmacotherapy; noncancer solid organ and hematopoietic stem cell transplant recipients; and people from the general population including adults and pediatric patients. The project successfully linked and harmonized longitudinal, de-identified data on 5.2 million unique patients using privacy-preserving record lineage techniques. The system was developed in early 2024 and rapidly deployed, enabling longitudinal analysis of patient healthcare over the full geography of delivery settings and exploration of novel questions for populations at high risk for adverse outcomes.

Conclusions

The successful development of the CRWDi enables researchers to address unanswered questions that have arisen during the COVID-19 pandemic. By making the data broadly and freely available to academic researchers, this real-world data system represents an important complement to existing consortia and clinical trials that have emerged during the healthcare crisis and is readily reproducible for future purposing.

Keywords: immunocompromised, SARS-CoV-2, SEER, serology, vaccination


The COVID-19 Real World Data infrastructure dataset contains 5.2 million de-identified patient records, with focus on immunocompromising conditions, and is freely available to approved researchers to study the impact of COVID-19 on patient morbidities and outcomes.


Over the course of the coronavirus disease 2019 (COVID-19) pandemic, large-scale real-world datasets have enabled study of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and COVID-19 illness outcomes within general and high-risk populations [1–3]. A critical population of interest is immunocompromised patients, who either have cancer or diseases of autoimmunity or are receiving treatments that impair the immune system. These individuals have an increased risk for contracting infections, suffering poorer outcomes and generating and spreading new SARS-CoV-2 variants, constituting a public health risk [4]. Immunocompromised patients were generally excluded from participation in the initial COVID-19 vaccine immunogenicity and efficacy trials [5, 6]. Existing real-world datasets have left many questions unanswered pertaining to assessment and treatment of immunocompromised patients [7, 8]. Evidence-based guidelines remain high-level and do not provide detailed recommendations relevant to this high-risk population [9]. Studies of the quantitative immune response to SARS-CoV-2 in immunocompromised patients have been limited to the smaller size of prospective cohort studies [10–12], with larger real-world data studies relying primarily on qualitative serology test data only [7, 13–15]. Prior studies using administrative claims data and semi-quantitative SARS-CoV-2 laboratory test data have shown differing risks of serious outcomes after COVID-19 illness in immunocompromised patients with low-positive versus high-positive humoral immune responses [16, 17]. There is a continuing need to better understand the course of COVID-19 illness, its consequences, and vaccine responsiveness in these high-risk patient populations.

A combined, linked, de-identified infrastructure was developed, by leveraging real-world data across multiple data sources and organizations from the United States (US), which contributed important clinical data and/or risk markers related to COVID-19 outcomes. The initial focus of the infrastructure was on high-risk immunocompromised patients, to support research on such questions as: addressing the variability in immune response to vaccination and SARS-CoV-2 infection [18, 19]; the efficacy of vaccination against development of symptomatic and potentially severe disease [20–22]; the efficacy of vaccination in preventing postacute sequelae of COVID-19 illness [23–25]; the potential impact of SARS-CoV-2 infection on the occurrence of new cancers and other comorbid conditions [26]; and the impact of COVID-19 on mortality in patients with cancer [27–29]. An extended set of addressable research questions is given in Supplementary Table 1. Moreover, the establishment of CRWDi serves as proof-of-principle for making available a harmonized national data resource that can inform evolving public health emergen­cies such as a pandemic, particularly for patients who carry risk profiles that differ from the general population.

MATERIALS AND METHODS

CRWDi Development and Oversight

HealthVerity (Philadelphia, Pennsylvania) established CRWDi under the guidance of a subject matter expert working group having representation from government, academia, and industry (see Acknowledgements for membership). HealthVerity and Aetion, Inc (New York, New York) also were members of the working group, in their capacities as data provider and analytic experts, respectively. Project planning by the working group was initiated in February 2023 and completed by November 2023. The HealthVerity data extraction occurred in December 2023, with a refresher update in April 2024 to obtain reconciled claims records through 31 December 2023. CRWDi was made publicly available in May 2024 (https://seer.cancer.gov/data-software/crwdi).

Data Sources

A summarized list of the key data categories and their sources is shown in Table 1. A detailed description of the quality control used for data acquisition is provided in the Supplementary Materials. For linkage of medical claims, pharmacy claims, vaccination status data, and laboratory data, HealthVerity utilizes a highly reliable privacy-preserving record linkage (PPRL) solution to allow for data sharing while maintaining privacy [30]. In a 2023 report from the National Institutes of Health, the HealthVerity PPRL solution was ranked as leveraging one of most accurate PPRL technologies [31]. De-identified tokens are generated for each patient at each data source [32]; token matching then takes place exclusively at HealthVerity against the much larger encrypted and tokenized dataset maintained by HealthVerity. While patient identities are not released from the data owner, HealthVerity utilizes an independent third-party reviewer to ensure that all data are compliant with privacy requirements and have minimal risk of reidentifiability.

Table 1.

Data Sources and Data Categories Contributing to COVID-19 Real World Data Infrastructure (CRWDi)

Data Source Data Category Data Content
HealthVerity: Medical claimsa Medical diagnosis (ICD-10-CM) COVID-19, COVID-19–related conditions, long COVID
Cancer (including cancer diagnoses through 31 Dec 2023)
Rheumatic diseases (including systemic autoimmune conditions)
Transplantation (solid organ and hematopoietic stem cell)
Comorbid conditions
Health resources utilization (ICD-10-PCS, CPT, HCPCS) Hospitalization (including admission to intensive care unit)
Hospital length of stay
Ambulatory care
Diagnostic services
Treatment for cancer (infusion-based chemotherapy, oral antineoplastic therapy)
Treatment for rheumatic diseases (infusion-based therapy, oral therapy, injections)
Treatment for COVID-19
Treatment for comorbid and/or other intercurrent acute or chronic conditions
HealthVerity: Pharmacy claimsa NDC codes Treatment for cancer (oral antineoplastic therapy)
Treatment for COVID-19 (oral antiviral therapy)
Treatment for comorbid and/or other intercurrent acute or chronic conditions
HealthVerity: Vaccination status Pharmacy NDC codes; public health sources COVID-19 vaccination (vaccine type)
Laboratory datab SARS-CoV-2 testing results NAAT (detected, not detected)
Anti-S IgG (detected, not detected; semi-quantitative test results)
Anti-N IgG (detected, not detected; semi-quantitative test results; Northwell only)
SEER registryc Characterization of the tumor Tumor site, histology, grade
Treatment Surgery, chemotherapy, radiation therapy, hormonal therapy; neoadjuvant, adjuvant, subsequent treatments (eg, maintenance, recurrence)
Survival Survival time (months since date of diagnosis)
Cause of death (cancer, noncancer)

Abbreviations: COVID-19, coronavirus disease 2019; CPT, Current Procedural Terminology; HCPCS, Healthcare Common Procedure Coding System; ICD-10-CM, International Classification of Diseases, Tenth Revision, Clinical Modification; ICD-10-PCS, International Classification of Diseases, Tenth Revision, Procedure Coding System; IgG, immunoglobulin G; NAAT, nucleic acid amplification test; NDC, National Drug Code; SARS-CoV-2, severe acute respiratory syndrome coronavirus 2; SEER, Surveillance, Epidemiology, and End Results.

aHealthVerity medical and pharmacy claims data are from 1 December 2018 to 31 December 2023, and include date of service or date of prescription; pharmacy claims also have accompanying ICD-10-CM codes.

bSARS-CoV-2 testing data are from Laboratory Corporation of America Holdings (Burlington, North Carolina), Quest Diagnostics Inc (Secaucus, New Jersey), and Northwell Health (New Hyde Park, New York), and include date of service.

cSEER registry data are for calendar years 2010 to 2021.

SARS-CoV-2 laboratory testing data were obtained from 2 major commercial laboratories (Laboratory Corporation of America [Labcorp], Burlington, North Carolina; and Quest Diagnostics, Secaucus, New Jersey) and from Northwell Health (hereafter “Northwell,” New Hyde Park, New York). For the commercial laboratories, testing data were based both on provision of clinically indicated testing and on surveillance testing performed on remnant blood specimens under contractual arrangements with the Centers and Disease Control and Prevention (CDC) [33]. Northwell laboratory data were based on provision of clinically indicated testing. For all sources, SARS-CoV-2 nucleic acid amplification test (NAAT) data were qualitative only (“detected” or “not detected”) and did not include cycle threshold values from the polymerase chain reaction assays on relevant platforms. Serologic SARS-CoV-2 anti-S test data were both qualitative (“detected” or “not detected”) and semi-quantitative. Northwell also contributed SARS-CoV-2 anti-N immunoglobulin G test data, both qualitative and semi-quantitative. Laboratory test data from all 3 sources had accompanying Logical Observation Identifier Names and Codes, and for the anti-S and anti-N semi-quantitative testing data, units, refer­ence ranges, thresholds for detection, and numerical results, as per the assay platforms used by each laboratory.

To substantially enhance the robustness of information regarding patients with cancer, data from the National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) registry were linked to existing patient records in the CRWDi. SEER includes 18 population-based cancer registries, representing approximately 48% of the US population (https://seer.cancer.gov/). This level of coverage provides for analytic validity and generalization with equitable selection of subjects [34]. Cancer data from diagnosis years 2010–2021 were linked, including data on patient demographics, primary tumor site, tumor morphology, stage at diagnosis, first course of treatment, and follow-up vital status. SEER information was thus available regarding cancer diagnosed both before (2010–2019) and during the first 2 calendar years of the pandemic (2020 and 2021), concordant with the 2-year lag time for complete case reporting to the SEER program [35]. The inclusion of patients with incident cancer well prior to the pandemic permits the opportunity to study the associated risks of SARS-CoV-2 infection in cancer survivors, including those with recurrent disease.

The respective strengths and limitations of these data sources are provided in Supplementary Table 2.

CRWDi Inclusion Criteria

The dataset was contractually targeted to be 5.2 million patients, based on the inclusion criteria given in Table 2. The hierarchical development of CRWDi is shown in Figure 1.

Table 2.

Inclusion Criteria for COVID-19 Real-World Data Infrastructure (CRWDi)

HealthVerity COVID-19 Master Set HealthVerity medical and/or pharmacy claims with ICD-10 coding
for COVID-19, and/or
HealthVerity data documenting COVID-19 vaccination, and/or
Laboratory test dataa for:
SARS-CoV-2 nucleic acid amplification diagnostic testing, and/or
SARS-CoV-2 serologic testing (total IgG and/or anti-S IgG)
General inclusion criteria
 Benefits enrollment dates 1 Dec 2018 to 31 Dec 2023
 Claims data Medical and pharmacy closed claims
 Enrollment duration ≥180 d continuous health plan enrollment within the period 1 Dec 2018 to 31 Dec 2023, on the basis of closed claims; open claims data made available for analysis of patients meeting the ≥180 d inclusion criterion for closed claims
 Age 0–89 y; patients >89 y coded as 90 y of age
Hierarchically applied criteria
 Cancer patients All HealthVerity COVID-19 Master Set patients meeting above inclusion criteria, who overlapped with the NCI SEER registry (inclusion dates 1 Jan 2010 to 31 Dec 2021); then
 Rheumatic disease patientsb Remaining COVID-19 Master Set patients with ICD-10 coding for rheumatic disease conditions, and with closed claims coding for active pharmaceutic treatment for rheumatic disease; then
 Transplant patientsc Remaining COVID-19 Master Set patients with ICD-10 coding for solid organ transplant or hematopoietic stem cell transplant; then
 General populationd Remaining COVID-19 Master Set patients who had laboratory SARS-CoV-2 serologic testing data; pediatric ages ≤18 y sampled to limit to the % pediatric population in the cancer, rheumatic, and transplant cohorts (10%). As the number of both general population adult and pediatric patients with serologic testing data did not complete the planned total CRWDi population of 5.2 million, remaining general population sampled to bring general population cohort into accordance with the age and sex distribution of 2020 US census, again limiting pediatric contribution to 10%.
Northwell Health criteria
 SARS-CoV-2 laboratory data linked to existing COVID-19 Master Set patients, or to expand the COVID-19 Master Set Laboratory testing data:
SARS-CoV-2 nucleic acid amplification diagnostic testing
SARS-CoV-2 serologic testing (total IgG; Apr 2020 to Apr 2021)
SARS-CoV-2 anti-S serologic testing (IgG; Apr 2021 to Apr 2023)
SARS-CoV-2 anti-N serologic testing (IgG; Apr 2021 to Apr 2023)

Enrollment criteria were applied sequentially. All de-identified patient records were drawn from HealthVerity Marketplace (HVM); data received from commercial laboratories, SEER registry, and Northwell Health were harmonized with HVM patient records.

Abbreviations: COVID-19, coronavirus disease 2019; CRWDi, COVID-19 Real World Data infrastructure; ICD-10, International Classification of Diseases, Tenth Revision; IgG, immunoglobulin G; NCI, National Cancer Institute; SARS-CoV-2, severe acute respiratory syndrome coronavirus 2; SEER, Surveillance, Epidemiology, and End Results.

aAs available from: Laboratory Corporation of America Holdings (Burlington, North Carolina) and Quest Diagnostics Inc (Secaucus, New Jersey).

bMutually exclusive relative to cancer patients; does not include rheumatic patients captured as part of cancer patient cohort.

cMutually exclusive relative to cancer and rheumatic patients; does not include transplant patients captured as part of cancer or rheumatic patient cohorts.

dMutually exclusive relative to cancer, rheumatic disease, and transplant patients.

Figure 1.

Figure 1.

Waterfall chart of COVID-19 Real World Data infrastructure (CRWDi) cohort selection. Starting at the upper left, the sequential application of inclusion criteria presented in Table 1 yielded the population numbers shown in each box. The left column shows the extract for Surveillance, Epidemiology, and End Results (SEER) registry–informed cancer patients. The second column is the mutually exclusive remainder extracted for “active” rheumatic conditions (on medication during the minimum 180-d enrollment period). All patients with active rheumatic conditions other than rheumatoid arthritis or spondyloarthritis were included. Owing to their large numbers, patients having active rheumatoid arthritis or spondyloarthritis were sampled for inclusion in CRWDi. Since International Classification of Diseases, Tenth Revision coding might include patients with rheumatic disease coding in both subgroups, the actual number of unique rhematic patients is as given in the “Active Rheumatic Disease“ box; the number of patients having diagnostic codes in both “Other Rheumatic“ and “RA and spondyloarthritis“ categories is shown at bottom. The third column is the mutually exclusive remainder extracted for solid organ or hematopoietic stem cell transplantation; the transplant subgroups were mutually exclusive. The fourth column is the remainder following hierarchical selection of the first 3 patient groups, constituting a general population. The sampling of this population to obtain the numbers of unique adult and pediatric patients is given in the Materials and Methods. *HealthVerity COVID-19 Master Set for patients with benefits enrollment from 1 Dec 2018 to 31 Dec 2023. Criteria for inclusion are given in Table 2. **SEER registry data for patients with cancer diagnoses in calendar years 2010 to 2021. Abbreviations: COVID-19, coronavirus disease 2019; HV, HealthVerity; NCI SEER, National Cancer Institute Surveillance, Epidemiology, and End Results registry; RA, rheumatoid arthritis; Rheum, rheumatic.

HealthVerity COVID-19 Master Set

Patient records informed by claims data with International Classification of Diseases, Tenth Revision (ICD-10) coding for COVID-19, documented vaccination for COVID-19, and/or SARS-CoV-2 laboratory testing data from 2 major commercial laboratories qualified a patient for inclusion in the HealthVerity COVID-19 Master Set (https://healthverity.com/mastersets/COVID-19). Claims-based ICD-10 coding for COVID-19 or COVID-19–related conditions began on 1 April 2020 when the COVID-19 diagnosis code U07.1was implemented [36]. This code has positive predictive value of >90% for hospitalized patients and 70%–77% for outpatients [37–39]. In turn, the COVID-19 Master Set has extensive information on US COVID-19 vaccination, owing to its drawing upon both administrative pharmacy claims data for vaccination and, for a subset of states, public health records, in the latter instance set up as a part of the US response to the COVID-19 pandemic [40]. At the time of CRWDi development, the COVID-19 Master Set contained 257 948 041 patient records, which were the source records for application of the following specific CRWDi inclusion criteria.

CRWDi General Inclusion Criteria

First, the patients must have had preexisting de-identified benefits enrollment records in the HealthVerity COVID-19 Master Set for the time interval 1 December 2018 to 31 December 2023. The starting date of 1 December 2018 was selected to obtain healthcare information predating COVID-19, thereby providing valuable longitudinal information (when available) on preexisting immunocompromising and comorbid conditions and on baseline health resource utilization.

Second, patients needed to be enrolled in a health plan with both medical and pharmacy benefits, with closed claims data available. Third, patients needed to have been enrolled in a health plan for at least 180 days during the time interval 1 December 2018 to 31 December 2023. The 180 days’ enrollment was chosen to increase the likelihood of having robust longitu­dinal claims data. The fact that patients with 2018 and 2019 claims data had COVID-19–relevant information (diagnostic codes, laboratory test data, and/or vaccination data) qualifying them for inclusion in the COVID-19 Master Set is indication that those patients had continuous benefits enrollment extending into the time frame of the COVID-19 pandemic. Indeed, analysis of the SEER cancer cohort of patients showed that their mean enrollment period was 1543 days, the majority of the 1855 days between 1 December 2018 to 31 December 2023 (data not shown), supporting the premise of substantive continuous enrollment during the 5-year CRWDi time frame. Last, although patients of all ages (in years) were to be included, those >89 years of age were coded as age 90 years for patient privacy.

Given that CRWDi intake of SARS-CoV-2 laboratory testing data included patients testing negative by either or both NAAT or serology, and intake of patients who had been vaccinated for COVID-19 but were otherwise not diagnosed with COVID-19, CRWDi contains patients who are negative for SARS-CoV-2 infection or COVID-19 illness, as well as those who are positive. These SARS-CoV-2/COVID-19–negative patients serve as a valuable comparison group for studies of patients who were infected with SARS-CoV-2 and/or coded for COVID-19 illness. We note that with the advent of home antigen testing, patients with mild SARS-CoV-2 infection who did not seek medical care may be missed. However, since the scientific focus for CRWDi is to assess risk for medically diagnosed severe disease and/or post-COVID-19 sequelae in immunocompromised populations, it is likely that the CWRDi will have a high diagnostic capture rate for this important subset of the SARS-CoV-2–infected population. This includes a higher likelihood for their SARS-CoV-2 testing being physician-ordered tests and hence performed by a licensed clinical laboratory.

The above inclusion criteria for CRWDi were applied to all patients. The remaining inclusion criteria were applied sequentially and hierarchically.

Cancer Cohort

The cancer cohort was defined as all SEER patients with cancer of any type, incident during calendar years 2010 to 2021, who linked with COVID-19 Master Set records meeting general inclusion criteria. The number of such patients loaded into CRWDi was 1 294 022.

Rheumatic Disease Cohort

Drawn from COVID-19 Master Set records meeting general inclusion criteria and after extraction of the cancer cohort, the cohort of active rheumatic disease patients had (1) a rheumatic disease coded at least once in either medical and/or pharmacy claims; and (2) received treatment for a rheumatic disease during the required 180-day enroll­ment period. The ICD-10, Clinical Modification (ICD-CM) codes for rheumatic conditions were as reported by D'Silva et al [41] and Qian et al [42]. Initial analysis of the CRWDi dataset identified 3 270 257 unique patients with diagnostic codes for rheumatic conditions, of whom 2 236 041 had received treatment during a 180-day enrollment period and were thus active. The hierarchical uploading of SEER cancer patients (see above) had already included 63 060 patients with active rheumatic disease. Of the remaining patients in the COVID-19 Master Set with active rheumatic disease (n = 2 172 981), we uploaded all patients (n = 1 108 111) with active rheumatic conditions other than rheumatoid arthritis and spondyloarthritis. As these latter 2 conditions were the most common active rheumatic diseases among noncancer rheumatic patients, we sampled these 2 conditions to bring the total number of active rheumatic disease patients to 1.7 million. The total number was selected so as to leave room for the inclusion of 2.0 million general population patients in CRWDi.

Transplant Cohort

Drawn from COVID-19 Master Set records meeting the general inclusion criteria and after extraction of the cancer and active rheumatic disease cohorts, the cohort of transplant patients had ICD-10-CM and/or ICD-10, Procedure Coding System (ICD-10-PCS) coding for solid organ transplant (SOT) or hematopoietic stem cell transplant (HSCT). All such patients were imported based on the premise that transplantation of any type would generate patient cohorts of interest. Since 57 023 patients coded for transplantation (SOT or HSCT) had already been included in the SEER cancer cohort and/or rheumatic disease cohort, 279 969 patients remained who had coding for transplantation. These patients were added to CRWDi.

General Population Cohort

The remaining COVID-19 Master Set patients who met the general inclusion criteria were considered a general population, to serve as a source for compari­son groups for specific research studies. The 3 cohorts previously selected hierarchically (cancer, rheumatic disease, transplant) had an aggregate age distribution of approximately 90% adult (aged ≥18 years) and 10% pediatric (aged <18 years), which differs from the 78% adult and 22% pediatric population of the US reported by the 2020 census [43]. To optimize the ability of researchers to identify general population subcohorts of ages appropriate to their studies of immunocompromised patients, the general population selected for CRWDi was also targeted to be 90% adult and 10% pediatric patients. Since a key goal of creating CRWDi was to obtain real-world evidence on the potential value of SARS-CoV-2 serologic testing, the first extrac­tion of general population patients was those who had SARS-CoV-2 serologic test results. As this first extract (1 602 161 adult patients; 173 462 pediatric patients) did not fully meet the target general population cohort size of 2.0 million patients, a further sample was obtained of general population patients meeting the general inclusion criteria but who did not have SARS-CoV-2 serologic test results, to bring the total general population number to 2.0 million.

Northwell Health Data

Northwell was chosen as a contributing health system, both for having SARS-CoV-2 virologic NAATs from the start of the COVID-19 pandemic in the greater New York metropolitan area (beginning 7 March 2020 [44]) and for having implemented serologic testing for COVID-19 on 20 April 2020. In the latter instance, although the CDC did not recommend SARS-CoV-2 serologic testing in the clinical practice setting [45], Northwell had successfully utilized such testing in evaluation of the healthcare workforce, demonstrating that through the height of the April–June 2020 crisis, frontline healthcare workers did not have increased risk for SARS-CoV-2 seroconversion compared to their respective geographic communities [46]. Physician-ordered SARS-CoV-2 serologic testing was not actively discouraged within the Northwell Health system so that substantial volumes of such testing accrued. This real-world experience from an integrated health system would hopefully enhance the ability of CRWDi to inform development of clinical guidelines for use of SARS-CoV-2 serologic testing in immunocompromised patients. Beyond that, Northwell’s contribution of SARS-CoV-2 laboratory testing data for patients not otherwise informed by administrative COVID-19 data from HealthVerity and/or commercial laboratory data made such Northwell patients eligible for inclusion in CRWDi. The goal was thus for Northwell to augment CRWDi SARS-CoV-2 testing data obtained from the commercial laboratory sources, both diagnostic (NAAT) and serologic. For Northwell patients with SARS-CoV-2 NAAT and/or serologic testing data whose de-identified records were not already present in the COVID-19 Master Set, the general inclusion criterion of 180 days’ continuous enrollment in benefits plan based on closed claim records was not applied, constituting the only exception to the 180-day rule.

A total of 336 545 Northwell patients linked with HealthVerity patient records. Of these, 40 654 patients were already present in the 5.2 million CRWDi patient population; Northwell SARS-CoV-2 diagnostic and serologic testing data thus supplemented these patient records. However, the other 295 891 Northwell patients became eligible for addition to CRWDi (not shown in Figure 1 and Figure 2), based on Northwell SARS-CoV-2 laboratory data informing COVID-19 status. These additional Northwell patients are available for study.

Figure 2.

Figure 2.

Venn diagram of COVID-19 Real World Data infrastructure cohorts. Following the hierarchical cohort selection process shown in Figure 1, the final patient distributions between the 3 cohorts of immunocompromised patients are shown, to show the overlap of patients with >1 condition. The mutually exclusive adult and pediatric population cohorts also are shown. Circle area is proportional to patient cohort size. The total number of de-identified patients totals exactly 5 200 000. Abbreviations: Rheum, rheumatic; SEER, National Cancer Institute Surveillance, Epidemiology, and End Results registry.

Data Availability for Researchers

This data resource was developed to support academic, noncommercial research projects in the US. Submitted proposals are reviewed by the NCI for appropriateness of the proposal to the data resource (https://seer.cancer.gov/data-software/crwdi/). Upon approval, obtaining access to CRWDi requires both NCI-authorized access to the SEER registry, and HealthVerity-authorized access to the cloud-based cohort discovery tool and analytic platform housing the CRWDi data.

The CRWDi analytics environment leverages the Databricks application (Databricks, Inc, San Francisco, California) to provide researchers with a robust analytics layer. The CRWDi analytics platform capabilities include built-in support for multiple programming languages that are widely used in data analytics and machine learning, including Python, R, SQL, and Scala. Owing to privacy concerns, the cloud-based analytics precludes download of individual patient data but permits download of aggregated results such as tables and figures.

Patient Consent Statement

The institutional review board (IRB) of Northwell approved the contribution of Northwell data to CRWDi, under an exempt protocol and without requirement for patient consent. For proposed studies using CRWDi, each research investigator must obtain and show proof of IRB approval before the NCI will authorize HealthVerity to grant access to CRWDi.

RESULTS

The objective of this study was to create an integrated, harmonized national data resource for study of the impact of COVID-19 on immunocompromised patient populations, with a general population available for comparison purposes. The CRWDi patient cohorts shown in the gray boxes of Figure 1 total exactly 5 200 000 unique patients. Owing to the hierarchical application of selection criteria (see Materials and Methods), there is overlap between the first 3 cohorts. These overlapping relationships are shown in Figure 2.

The successful outcomes of this study are given in Table 3, which highlights the project aims and outcomes and identifies some of the key strengths of the infrastructure. For researchers seeking to identify specific subcohorts of patients for study, including patients who might have both cancer and/or a rheumatologic condition and/or be transplant patients (SOT or HSCT), the detailed structured data shown in Table 2 are available for analytic purposes, as is the rich dataset from the SEER registry to inform the cancer subset of patients.

Table 3.

COVID-19 Real World Data Infrastructure (CRWDi): Work Accomplished and Future Opportunities

Project Aims Outcomes Key Infrastructure Strengths
1. Assess the ability to deploy a real-world data system to rapidly address critical issues during a public health crisis
  • Integrated targeted data into a unified infrastructure, including large populations of high-risk patients typically not included in clinical trials

  • Widely divergent data sources linked using PPRL technology to create a comprehen­sive data infrastructure

  • From first data pull and with just-in-time administrative claims update, deployed the integrated system in 5 mo

  • Administrative medical and pharmacy claims data, harmonized with:

    • COVID-19 vaccination data from pharmacy claims and from public health registries

    • Cancer registries

    • SARS-CoV-2 laboratory test data from commercial and nonprofit sources

  • Allows for longitudinal analysis of patient healthcare over full geography of delivery settings

  • Allows exploration of novel questions and issues as a public health emergency evolves over time

2. Develop a system with broad generalizability that can be replicated for future public health emergencies
  • Initial focus on immunocompromised patients with COVID-19 as a successful use case

  • Subsequent focus on longitudinal occurrence/recurrence of cancer and on risk of comorbidities following COVID-19

  • Current CRWDi infrastructure includes up to 4 y of longitudinal data on >5 million unique individuals

  • Ongoing updates will provide an additional 2 y of follow-up

3. Provide a data and analytic resource to generate a targeted set of prescreened patients “preparatory to research”
  • Data freely available for use by US-based nonprofit research groups

  • Enables widely varied cohort creation from subgroups with immunocompromising conditions

  • General population available for control comparison

  • Capability to work in conjunction with the development of clinical trials:

    • For identification of potential eligible patient groups

    • To assess generalizability of clinical trial findings

  • This large sample with ongoing longitudinal follow-up will support prescreening and identification of populations at high risk for adverse outcomes

4. Develop methods to ensure system is available to the broadest set of researchers for optimal utility
  • Established the cloud analytic infrastructure (including analytic support) accessible to researchers

  • Established review and approval processes to enable expedited access by the research community

  • Currently 12 research projects with 10 research groups beyond NCI in process and using CRWDi

  • Opportunity for creation of coalition of CRWDi users

Abbreviations: COVID-19, coronavirus disease 2019; CRWDi, COVID-19 Real World Data infrastructure; NCI, National Cancer Institute; PPRL, privacy-preserving record lineage; SARS-CoV-2, severe acute respiratory syndrome coronavirus 2.

The current SEER registry data in CRWDi is through 31 December 2021. Therefore, newer incident cancer cases would not be informed by SEER data; such considerations would become evident during the analytic phase of any research project. In turn, mortality data for SEER-matched patients are only available through 31 December 2021 in the current iteration of CRWDi. However, subsequent linkages will contain mortality on all CRWDi patients, via a linkage with Veritas Data Research (Claymont, Delaware).

The linkage of these diverse data sources provides an extensive set of information on each patient, including details regarding ambulatory pharmacy usage such as oral antineoplastic therapies for cancer patients, not usually available through health system–based electronic health records. Although other immunocompromising conditions were not the focus of establishing this data infrastructure, such other conditions can be studied on the basis of their ICD-10 coding in the CRWDi dataset.

Creation of this publicly available dataset enables study of questions relevant to the COVID-19 pandemic, not readily answerable through other data sources. Structured data that can inform those questions include dates associated with all diagnostic codes, procedural codes, pharmacy codes, laboratory tests, and vaccinations; hospitalization dates of admission and discharge; intensive care unit admissions; comorbid conditions; and, for patients linked to SEER registry data, detailed information on their cancers including treatment and their survival. The CRWDi infrastructure thus offers substantive opportunity for assessing longitudinal patient healthcare needs and outcomes in relationship to their COVID-19 status.

DISCUSSION

The CRWDi infrastructure was developed by linkage of multiple real-world data sources of patients in the US, and is now available to the national research community (https://seer.cancer.gov/data-software/crwdi). Collectively, CRWDi constitutes a nonrandom sample of real-world data informed by COVID-19 status, on a substantive population of immunocompromised patients, with a comparator general population. The fact that the general inclusion criteria were for de-identified patients who had COVID-19–related data, including negative as well as positive laboratory test results for COVID-19 (NAAT and/or serologic), means that research studies can be informed by patients who did experience SARS-CoV-2 infection and COVID-19 illness, in comparison to those who did not. It must be emphasized that this report describes the aims and outcomes of the initial project—to develop and make accessible the CRWDi. The infrastructure now provides researchers the opportunity to develop their own inclusion criteria, analytic cohorts, and research aims for a broad set of unanswered questions pertaining to the COVID-19 pandemic.

Having been established, this data resource also can be updated. Further, by modifying the inclusion criteria of an already established data infrastructure, the CRWDi approach to dataset development can be replicated relatively quickly in the event of a different health crisis or to address other important clinical or public health questions. Importantly, the use of PPRL for the consolidation of the multiple data sources [32] represents a relatively new but effective method to bring together a broad set of diverse data sources that only together can answer questions not possible using a single data source, however large.

While real-world data cannot replace clinical trials, the ability to leverage real-world datasets allows investigators and governmental agencies to focus on important subpopulations who are unlikely to be eligible for clinical trials. This resource may also be useful to validate observations obtained from other large, national consortia, particularly those based solely on electronic medical records [1–3]. A strong advantage of CRWDi is identification through administrative claims of longitudinal data on patients who have received care from multiple health systems and who have obtained and filled pharmacy prescriptions in the ambulatory setting. Indeed, information on ambulatory pharmaceutical treatment for both COVID-19 and coexistent conditions such as cancer or autoimmune disease provides a powerful enhancement for understanding the impact of COVID-19 in these higher-risk patients [47]. Although the claims-based foundation of CRWDi omits patients who do not have health insurance, this dataset is otherwise drawn from the entire US population, and so should yield generalizable findings.

Specific strengths of the administrative data from the HealthVerity COVID-19 Master Set brought into CRWDi include capture of comprehensive healthcare utilization, including diagnostic and treatment information, and having detailed vaccine administration data. The large dataset of SARS-CoV-2 diagnostic and, to a lesser extent, serologic testing from the 2 largest US commercial clinical laboratories provides extensive information from the ambulatory practice setting. In addition and unlike clinical trials with exclusions, CRWDi includes diverse populations without medical-status exclusion criteria, complementing trial data on COVID-19 vaccination [5, 6, 34, 48, 49].

The SEER registry provides high-quality adjudicated data on patients with incident cancer followed longitudinally from 2010 to 2021 with accurate long-term survival information, information that is not available through administrative claims data. An existing limitation of the NCI SEER registry is both that the data are through 31 December 2021 only, and that SEER data may be incomplete with regards to subsequent courses of therapy, disease progression, or recurrence [35]. Through CRWDi, linkage of SEER data to the claims data of HealthVerity provides access to extensive information about the longitudinal clinical course of SEER cancer patients through 31 December 2023, thus substantively extending both the chronologic reach and the comprehensive detail on treatments and outcomes of patients in the SEER registry database [35], including potential recurrence of cancer.

Moreover, the integration into CRWDi of novel data sources such as the Northwell data provides important proof-of-concept that health system–based electronic health system data can be linked to this national data resource. Notably, by providing a large additional sample of SARS-CoV-2 serologic test results, the Northwell data provide critical opportunity for validation of the potential value of SARS-CoV-2 serologic testing for patients of an integrated health system. In addition, the Northwell data expanded the reach of the HealthVerity COVID-19 Master Set by an additional 295 891 patients whose SARS-CoV-2 laboratory data could be linked to HealthVerity Marketplace. Both of these points underscore the importance of including data from health systems in development of a harmonized national data infrastructure for study of public health emergencies.

There are limitations of CRWDi, summarized here, with more detail provided in the Supplementary Materials. First, administrative coding for medical diagnoses (ICD-10-CM) and procedures (ICD-10-PCS, Current Procedural Terminology, Healthcare Common Procedure Coding System) and pharmacy (National Drug Code) and reporting of cancer diagnoses (SEER) is subject to potential coding inaccuracies [50–52] or incompleteness [42, 53, 54]. Second, the requirement for 180 days’ enrollment for the study period of 1 December 2018 to 31 December 2023 may limit generalizability to individuals without insurance enrollment [55] or may have introduced selection bias against previously enrolled patients who contracted COVID-19 early in the pandemic and died in <180 days but were outside their enrollment window. Third, the advent of home antigen testing raises the possibility of bias against patients with mild SARS-CoV-2 infection who remained in the ambulatory setting and did not seek medical attention. However, since the scientific focus for CRWDi is to assess risk for medically diagnosed severe disease and/or post-COVID-19 sequelae in immunocompromised populations, it is likely that the CWRDi will have a high capture rate for this important subset of the SARS-CoV-2–infected population. This includes a higher likelihood for their SARS-CoV-2 laboratory testing being physician-ordered tests and hence performed by a licensed clinical laboratory.

Fourth, since the SEER registry represents approximately half of cancer patients in the US, it is likely that some rheumatic and transplant patients, and patients in the general population, may be patients with cancer who are not in the SEER registry. The possibility of cancer diagnoses in the administrative claims record of non-SEER patients can be assessed during the analytic phase of research studies. Fifth, identification of patients with rheumatic disease for inclusion in CRWDi was on the basis of 1 relevant ICD-10 diagnostic code being present in the patient record. Researchers performing analytic studies of rheumatic disease patient cohorts should consider implementing validated algorithms for identifying patients, such as requiring repetition of the same rheumatic disease ICD-10 code at least 1 month apart [56].

A sixth issue is the missingness of healthcare data, which affects both administrative claims and the electronic medical records data utilized for other COVID-19 data infrastructures [1–3, 57]. Discontinuity in medical records and missingness of vaccine records has been highlighted as an issue by the National COVID Cohort Collaborative, hindering the study of prior treatment, comorbidities, and outcomes [58, 59]. Administrative claims data as constituted in CRWDi are actually considered as having an advantage in dealing with data missingness, both for establishing baseline information on prior treatment and comorbidities and identifying care gaps, and for assessing person-time (the number of people at risk of developing a disease or condition over a period of time) across the full geography of multiple care settings [60, 61]. The CRWDi requirement for at least 180 days’ continuous benefits enrollment should further enhance the availability of longitudinal patient data. For COVID-19 vaccination, CRWDi draws upon administrative claims data and public health sources well beyond what is available for studies based on electronic medical records, so should have more complete real-world information for examining vaccine efficacy.

CONCLUSIONS

The CRWDi infrastructure, consisting of COVID-19–related data based on healthcare claims and other data from US patients, achieved the several key objectives for which the system was developed. First, it demonstrated that a complex harmonized real-world data system can be established to address ongoing and new questions arising during a pandemic, that are not otherwise addressable by established data consortia. Second, CRWDi represents the ability to study important population subgroups that are typically underrepresented in prospective cohort studies and clinical trials, but for whom important clinical questions remain unanswered. Third, the data have been made broadly and freely available to academic researchers through an NCI-based proposal review and approval process. Fourth, the development of this infrastructure in the midst of an evolving pandemic crisis demonstrates proof-of-principle that the US can assemble a real-world data infrastructure relevant to a public health emergency on a timely basis, particularly since a paradigm for doing so has now been established. The scope of such an infrastructure is determined by the structured data selected for intake, including that obtained from licensed clinical laboratories and potentially public health laboratories. In summary, the CRWDi real-world data system represents an important complement to existing consortia studies and clinical trials that emerged during healthcare crisis, can be updated in continuation of its current purpose, and is reproducible for future purposing.

Supplementary Material

ofaf021_Supplementary_Data

Contributor Information

James M Crawford, Northwell, Department of Pathology and Laboratory Medicine, New Hyde Park, New York, USA.

Lynne Penberthy, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, USA.

Ligia A Pinto, Vaccine, Immunity and Cancer Directorate, Frederick National Laboratory for Cancer Research, Frederick, Maryland, USA.

Keri N Althoff, John Hopkins Bloomberg School of Public Health, Department of Epidemiology, Baltimore, Maryland, USA.

Magdalene M Assimon, Aetion Inc, New York, New York, USA.

Oren Cohen, Laboratory Corporation of America, Burlington, North Carolina, USA.

Laura Gillim, Laboratory Corporation of America, Burlington, North Carolina, USA.

Tracy L Hammonds, HealthVerity, Philadelphia, Pennsylvania, USA.

Shilpa Kapur, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, USA.

Harvey W Kaufman, Quest Diagnostics, Secaucus, New Jersey, USA.

David Kwasny, HealthVerity, Philadelphia, Pennsylvania, USA.

Jean W Liew, Boston University Chobanian and Avedisian School of Medicine, Section of Rheumatology, Boston, Massachusetts, USA.

William A Meyer, III, Quest Diagnostics, Secaucus, New Jersey, USA.

Shannon L Reynolds, Aetion Inc, New York, New York, USA.

Cheryl B Schleicher, Northwell, Department of Pathology and Laboratory Medicine, New Hyde Park, New York, USA.

Suki Subbiah, Louisiana State University Health Sciences Center, Section of Hematology/Oncology, New Orleans, Louisiana, USA.

Catherine Theruviparampil, Aetion Inc, New York, New York, USA.

Zachary S Wallace, Massachusetts General Hospital, Division of Rheumatology, Allergy, and Immunology, Boston, Massachusetts, USA.

Jeremy L Warner, Brown University, Department of Medicine, Providence, Rhode Island, USA; Rhode Island Hospital, Providence, Rhode Island, USA.

Suhyeon Yoon, Integrated Data Sciences Section, Research Technologies Branch, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA.

Yonah C Ziemba, Northwell, Department of Pathology and Laboratory Medicine, New Hyde Park, New York, USA.

Supplementary Data

Supplementary materials are available at Open Forum Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.

Notes

Acknowledgments. The authors thank these individuals who also contributed to CRWDi planning discussions: Tegan Boehmer, Commander, Adi Gundlapalli, PhD, and Matthew Ritchie, PhD, of the Centers for Disease Control and Prevention, Atlanta, Georgia. These individuals from HealthVerity, Inc, also are thanked for their contributions: Rick Edwards, Christopher Williams, and Sean Thompson. The members of the CRWDi Subject Matter Expert Working Group were: J. M. C. (co-chair), L. P. (co-chair), L. A. P., K. N. A., M. M. A., O. C., L. G., T. L. H., H. W. K., D. K., J. W. L., W. A. M., S. L. R., S. S., C. T., Z. S. W., J. L. W., and S. Y.

Author contributions. Conception and design: J. M. C., L. P., L. A. P., K. N. A., M. M. A., O. C., L. G., S. K., H. W. K., J. W. L., W. A. M., S. L. R., S. S., C. T., Z. S. W., and J. L. W. Acquisition of data: T. L. H., D. K., C. B. S., and Y. C. Z. Analysis and interpretation of data: J. M. C., L. P., T. L. H., S. Y., and Y. C. Z. Wrote the manuscript: J. M. C. and L. P. All authors reviewed and revised the manuscript, approved the final version, and agree to be accountable for all aspects of the study.

Disclaimer. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does the mention of trade names, commercial products, or organizations imply endorsement by the US government.

Financial support. This project was funded in whole or in part by federal funds from the National Cancer Institute, National Institutes of Health (NIH), under contract number 75N91019D00024.

References

  • 1. Haendel  MA, Chute  CG, Bennett  TD, et al.  The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. J Am Med Informatics Assoc  2021; 28:427–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Kulkarni  AA, Hennessy  C, Wilson  G, et al.  Brief report: impact of anti-cancer treatments on outcomes of COVID-19 in patients with thoracic cancers: a CCC19 registry analysis. Clin Lung Cancer  2024; 25:e229–37.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Yazdany  J, Ware  A, Wallace  ZS, et al.  Impact of risk factors on COVID-19 outcomes in unvaccinated people with rheumatic diseases. A comparative analysis of pandemic epochs using the COVID-19 Global Rheumatology Alliance Registry. Arthritis Care Res (Hoboken)  2024; 76:274–87. [DOI] [PubMed] [Google Scholar]
  • 4. Bonanni  P, Ceddia  F, Dawson  R. A call to action: current challenges and considerations for COVID-19 vaccination in immunocompromised populations. J Infect Dis  2023; 228(Suppl 1):S70–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Liatsou  E, Ntanasis-Stathopoulos  I, Lykos  S, et al.  Adult patients with cancer have impaired humoral responses to complete and booster COVID-19 vaccination, especially those with hematologic cancer on active treatment: a systematic review and meta-analysis. Cancers (Basel)  2023; 15:2266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Desai  A, Gainor  JF, Hegde  A, et al.  COVID-19 vaccine guidance for patients with cancer participating in oncology clinical trials. Nat Rev Clin Oncol  2021; 18:313–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Rodriguez-Watson  CV, Louder  AM, Kabelac  C, et al.  Real-world performance of SARS-CoV-2 serology tests in the United States, 2020. PLoS One  2023; 18:e0279956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Hayden  MK, El Mikati  IK, Hanson  KE, et al.  Infectious Diseases Society of America guidelines on the diagnosis of COVID-19: serologic testing [manuscript published online ahead of print 15 March 2024]. Clin Infect Dis  2024. doi: 10.1093/cid/ciae121 [DOI] [PubMed] [Google Scholar]
  • 9. Bhimraj  A, Morgan  RL, Shumaker  AH, et al.  Infectious Diseases Society of America guidelines on the treatment and management of patients with COVID-19 (September 2022). Clin Infect Dis  2024; 78:e250–349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Zhang  Z, Mateus  J, Coelho  C, et al.  Humoral and cellular immune memory to four COVID-19 vaccines. Cell  2022; 185:2434–51.e17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Gilboa  M, Gonen  T, Barda  N, et al.  Factors associated with protection from SARS-CoV-2 Omicron variant infection and disease among vaccinated health care workers in Israel. JAMA Netw Open  2023; 6:e2341757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Mack  PC, Hsu  CY, Rodilla  AM, et al.  Time-dependent effects of clinical interventions on SARS-CoV-2 immunity in patients with lung cancer. Vaccines (Basel)  2024; 12:713. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Harvey  RA, Rassen  JA, Kabelac  CA, et al.  Association of SARS-CoV-2 seropositivity antibody test with risk of future infection. JAMA Intern Med  2021; 181:672–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Kaufman  HW, Chen  Z, Meyer  WA  III, Wohlgemuth  JG. Insights from patterns of SARS-CoV-2 immunoglobulin G serology test results from a national clinical laboratory, United States, March–July 2020. Pop Health Management  2021; 24(Suppl 1):S35––42.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Reynolds  SL, Kaufman  HW, Meyer  WA  III, et al.  Risk of and duration of protection from SARS-CoV-2 reinfection assessed with real-world data. PLoS One  2023; 18:e0280584. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Kaufman  HW, Meyer  WA  III, Clarke  NJ, et al.  Assessing vulnerability to COVID-19 in high-risk populations: the role of SARS-CoV-2 spike-targeted serology. Popul Health Manag  2023; 26:29–36. [DOI] [PubMed] [Google Scholar]
  • 17. Kaufman  HW, Letovsky  S, Meyer  WA  III, et al.  SARS-CoV-2 spike-protein targeted serology test results and their association with subsequent COVID-19-related outcomes. Front Public Health  2023; 11:1193246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Ferri  C, Ursini  F, Gragnani  L, et al.  Impaired immunogenicity to COVID-19 vaccines in autoimmune systemic diseases. High prevalence of non-response in different patients’ subgroups. J Autoimmunity  2021; 125:102744. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Song  Q, Bates  B, Shao  YR, et al.  Risk and outcome of breakthrough COVID-19 infections in vaccinated patients with cancer: real-world evidence from the National COVID Cohort Collaborative. J Clin Oncol  2022; 40:1414–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Chenchula  S, Karunakaran  P, Sharma  S, Chavan  M. Current evidence on efficacy of COVID-19 booster dose vaccination against the Omicron variant: a systematic review. J Med Virol  2022; 94:2969–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Bhasavar  D, Singh  G, Sano  K, et al.  Mucosal antibody responses to SARS-CoV-2 booster vaccination and breakthrough infection. mBio  2023; 14:e0228023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Kalra  P, Ali  S, Ocen  S. Modelling on COVID-19 control with double and booster-dose vaccination. Gene  2024; 928:148795. [DOI] [PubMed] [Google Scholar]
  • 23. Marra  AR, Kobayashi  T, Callado  GY, et al.  The effectiveness of COVID-19 vaccine in the prevention of post-COVID conditions: a systematic literature review and meta-analysis of the latest research. Antimicrob Steward Healthc Epidemiol  2023; 3:e168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Zang  C, Hou  Y, Schenck  EJ, et al.  Identification of risk factors of long COVID and predictive modeling in the RECOVER EHR cohorts. Commun Med (Lond)  2024; 4:130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Vinson  AJ, Schissel  M, Anzalone  AJ, et al.  The prevalence of post-acute sequelae of COVID-19 in solid organ transplant recipients: evaluation of risk in the National COVID Cohort Collaborative (N3C). Am J Transpl  2024; 24:1675–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Ogarek  N, Oboza  P, Olszanecka-Glinianowicz  M, Kocelak  P. SARS-CoV-2 infection as a potential risk factor for the development of cancer. Front Mol Biosci  2023; 10:1260776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Linjawi  M, Shakoor  H, Hilary  S, et al.  Cancer patients during COVID-19 pandemic: a mini-review. Healthcare  2023; 11:248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Yang  L, Chai  P, Yu  J, Fan  X. Effects of cancer on patients with COVID-19: a systematic review and meta-analysis of 63,019 participants. Cancer Biol Med  2021; 18:298–307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Leston  M, Elson  W, Ordóñez-Mena  JM, et al.  Disparities in COVID-19 mortality amongst the immunosuppressed: a systematic review and meta-analysis for enhanced disease surveillance. J Infection  2024; 88:106110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Office of the President of the United States . National strategy to advance privacy-preserving data sharing and analytics. 2023. Available at: https://www.whitehouse.gov/wp-content/uploads/2023/03/National-Strategy-to-Advance-Privacy-Preserving-Data-Sharing-and-Analytics.pdf. Accessed 6 July 2024.
  • 31. Frederick National Laboratory for Cancer Research . Evaluating the performance of privacy preserving record linkage systems. 2023. Available at: https://surveillance.cancer.gov/reports/TO-P2-PPRLS-Evaluation-Report.pdf. Accessed 6 July 2024.
  • 32. Aronson  J. Landscape analysis of privacy preserving patient record linkage software (P3RLS). 2020. Available at: https://surveillance.cancer.gov/reports/TO-P1-PPRLS-Landscape-Analysis.pdf. Accessed 11 July 2024.
  • 33. Quest Diagnostics . Quest Diagnostics granted CDC contract to support COVID-19 infection and vaccination seroprevalence research. 2022. Available at: https://newsroom.questdiagnostics.com/2022-03-23-Quest-Diagnostics-Granted-CDC-Contract-to-Support-COVID-19-Infection-and-Vaccination-Seroprevalence-Research. Accessed 4 July 2024.
  • 34. Behring  M, Hale  K, Ozaydin  B, Grizzle  WE, Sodeke  SO, Manne  U. Inclusiveness and ethical considerations for observational, translational, and clinical cancer health disparity research. Cancer  2019; 125:4452–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Enewold  L, Parsons  H, Zhao  L, et al.  Updated overview of the SEER-Medicare data: enhanced content and applications. J Natl Cancer Inst Monogr  2020; 2020:lgz029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Centers for Disease Control and Prevention . New ICD-10-CM code for the 2019 novel coronavirus (COVID-19), April 1, 2020. Available at: https://www.cdc.gov/nchs/data/icd/Announcement-New-ICD-code-for-coronavirus-3-18-2020.pdf. Accessed 5 July 2024.
  • 37. Kluberg  SA, Hou  L, Dutcher  SK, et al.  Validation of diagnosis codes to identify hospitalized COVID-19 patients in health care claims data. Pharmacoepidemiol Drug Saf  2022; 31:476–80. [DOI] [PubMed] [Google Scholar]
  • 38. Lynch  KE, Viernes  B, Gatsby  E, et al.  Positive predictive value of COVID-19 ICD-10 diagnosis codes across calendar time and clinical setting. Clinical Epidemiol  2021; 13:1011–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Moura  CS, Neville  A, Liao  F, et al.  Validity of hospital diagnostic codes to identify SARS-CoV-2 infections in reference to polymerase chain reaction results: a descriptive study. CMAJ Open  2023; 11:E982–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Scharf  LF, Adeniyi  K, Augustini  E, et al.  Monitoring and reporting the US COVID-19 vaccination effort. Vaccine  2024; 42:125495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. D'Silva  KM, Serling-Boyd  N, Wallwork  R, et al.  Clinical characteristics and outcomes of patients with coronavirus disease 2019 (COVID-19) and rheumatic disease: a comparative cohort study from a US ‘hot spot.’  Ann Rheum Dis  2020; 79:1156–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Qian  G, Wang  X, Patel  NJ, et al.  Outcomes with and without outpatient SARS-CoV-2 treatment for patients with COVID-19 and systemic autoimmune rheumatic diseases: a retrospective cohort study. Lancet Rheumatol  2023; 5:e139–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. US Census Bureau . Census Bureau releases new 2020 census data on age, sex, race, Hispanic origin, households and housing. 2023. Available at: https://www.census.gov/newsroom/press-releases/2023/2020-census-demographic-profile-and-dhc.html. Accessed 5 September 2024.
  • 44. Reichberg  SB, Mitra  PP, Haghamad  A, et al.  Rapid emergence of SARS-CoV-2 in the greater New York metropolitan area: geolocation, demographics, positivity rates, and hospitalization for 46,793 persons tested by Northwell Health. Clin Infect Dis  2020; 71:3204–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. US Centers for Disease Control and Prevention . Interim guidelines for COVID-19 antibody testing. 2021. Available at: https://www.cdc.gov/coronavirus/2019-ncov/lab/resources/antibody-tests-guidelines.html. Accessed 10 October 2021.
  • 46. Moscola  J, Sembajwe  G, Jarrett  M, et al.  Prevalence of SARS-CoV-2 antibodies in health care personnel in the New York City area. J Am Med Assoc  2020; 324:893–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Zheng  Y, Chen  Y, Yu  K, et al.  Fatal infections among cancer patients: a population-based study in the United States. Infect Dis Ther  2021; 10:871–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Bertagnolli  MM, Anderson  B, Quina  A, Piantadosi  S. The electronic health record as a clinical trials tool: opportunities and challenges. Clinical Trials  2020; 17:237–42. [DOI] [PubMed] [Google Scholar]
  • 49. Almufleh  A, Joseph  J. The time is now: role of pragmatic clinical trials in guiding response to global pandemics. Trials  2021; 22:229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Nattinger  AB, Schapira  MM, Warren  JL, Earle  CC. Methodological issues in the use of administrative claims data to study surveillance after cancer treatment. Med Care  2002; 40(8 Suppl IV):69–74. [DOI] [PubMed] [Google Scholar]
  • 51. Zhao  Z, Xhu  B, Anderson  J, Fu  H, LeNarz  L. Resource utilization and healthcare costs for acute coronary syndrome patients with and without diabetes. J Med Econ  2010; 13:748–59. [DOI] [PubMed] [Google Scholar]
  • 52. Diaz-Garelli  F, Strowd  R, Lawson  VL, et al.  Workflow differences affect data accuracy in oncologic EHRs: a first step toward detangling the diagnosis data babel. JCO Clin Cancer Inform  2020; 4:529–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Thompson  WW, Symum  H, Sandul  A, et al.  Vital signs: hepatitis C treatment among insured adults—United States, 2019–2020. MMWR Morb Mortal Wkly Rep  2022; 71:1011–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Kompaniyets  L, Bull-Otterson  L, Boehmer  TK, et al.  Post-COVID-19 symptoms and conditions among children and adolescents—United States, March 1, 2020–January 31, 2022. MMWR Morb Mortal Wkly Rep  2022; 71:993–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Bhatt  AS, McElrath  EE, Claggett  BL, et al.  Accuracy of ICD-10 diagnostic codes to identify COVID-19 among hospitalized patients. J Gen Intern Med  2021; 36:2532–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Chung  CP, Rohan  P, Krishnaswami  S, McPheeters  ML. A systematic review of validated methods for identifying patients with rheumatoid arthritis using administrative or claims data. Vaccine  2012; 31:K41–61. [DOI] [PubMed] [Google Scholar]
  • 57. Jones  SE, Bradwell  KR, Chan  LE, et al.  Who is pregnant? Defining real-world data-based pregnancy episodes in the National COVID Cohort Collaborative (N3C). JAMIA Open  2023; 6:ooad067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Hendrix  N, Sidky  H, Sahner  DK; N3C Consortium . Influence of prior SARS-CoV-2 infection on COVID-19 severity: evidence from the National COVID Cohort Collaborative. medRxiv [Preprint]. Posted online 6 August 2024. doi: 10.1101/2023.08.03.23293612 [DOI] [Google Scholar]
  • 59. Sidky  H, Young  JC, Girvin  AT, et al.  Data quality considerations for evaluating COVID-19 treatments using real world data: learnings from the National COVID Cohort Collaborative (N3C). BMC Med Res Methodol  2023; 23:46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Getzen  E, Tan  AL, Brat  G, et al.  Leveraging informative missing data to learn about acute respiratory distress syndrome and mortality in long-term hospitalized COVID-19 patients throughout the years of the pandemic. AMIA Annu Symp Proc  2024; 2023:942–50. [PMC free article] [PubMed] [Google Scholar]
  • 61. Winterstein  AG, Ehrenstein  V, Brown  JS, Stürmer  T, Smith  MY. A road map for peer review of real-world evidence studies on safety and effectiveness of treatments. Diabetes Care  2023; 46:1448–54. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ofaf021_Supplementary_Data

Data Availability Statement

This data resource was developed to support academic, noncommercial research projects in the US. Submitted proposals are reviewed by the NCI for appropriateness of the proposal to the data resource (https://seer.cancer.gov/data-software/crwdi/). Upon approval, obtaining access to CRWDi requires both NCI-authorized access to the SEER registry, and HealthVerity-authorized access to the cloud-based cohort discovery tool and analytic platform housing the CRWDi data.

The CRWDi analytics environment leverages the Databricks application (Databricks, Inc, San Francisco, California) to provide researchers with a robust analytics layer. The CRWDi analytics platform capabilities include built-in support for multiple programming languages that are widely used in data analytics and machine learning, including Python, R, SQL, and Scala. Owing to privacy concerns, the cloud-based analytics precludes download of individual patient data but permits download of aggregated results such as tables and figures.


Articles from Open Forum Infectious Diseases are provided here courtesy of Oxford University Press

RESOURCES