Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Sep 8;34(9):e70202. doi: 10.1002/pds.70202

The Guardian Research Network: A Real‐World Data Source for Pharmacoepidemiologic Research and Regulatory Applications

Andrea McCracken 1, Julien Heidt 2,, Elizabeth Eldridge 2, Charlie Hurmiz 1, Nicole Duran 2, Adam Reich 2, Efe Eworuke 2
PMCID: PMC12417101  PMID: 40921624

ABSTRACT

Background

The quality of real‐world data (RWD) directly impacts the value of real‐world evidence (RWE) generated for regulatory decision‐making. Data owners and investigators must be prepared to provide documentation on data quality assessments to regulators when submitting secondary data for regulatory purposes. While robust feasibility is required to justify the relevance of a data source for a specific research question, the reliability of the data, including the chain of custody and data journey prior to reaching the end user, is of equal importance for drawing valid, meaningful conclusions.

Aims

Recently, Castellanos et al. constructed a definition of RWD quality by synthesizing definitions across published guidelines to characterize quality attributes of Flatiron Health RWD. In this paper, the transparent reporting of how data quality attributes (as defined by Castellanos et al.) are met in a single RWD source is replicated for the Guardian Research Network (GRN), a database of aggregated electronic health records (EHRs) collected from a geographically representative consortium of regional community health systems with experienced cancer research programs.

Materials & Methods

We first describe GRN, including the data elements collected, timeliness of data availability, representativeness, and data access considerations. We then provide descriptions of how data reliability (accuracy, traceability, timeliness, completeness) and relevance (availability, sufficiency, representativeness) are ensured and assessed in GRN, including illustrative examples of relevant data quality checks.

Results

Descriptions of GRN’s data quality processes demonstrate structured approaches to ensuring both reliability and relevance, aligned with published guidelines. Illustrative examples highlight the application of specific quality checks and their outcomes for GRN data.

Discussion

These findings illustrate the importance of documenting and communicating data quality attributes for RWD sources intended for regulatory use. Structured, transparent reporting can support more informed feasibility assessments and facilitate regulator confidence in RWE generation.

Conclusion

Continued development of structured approaches to identifying data fit for regulatory use underscores the need for comprehensive information about putative data sources during feasibility to inform decision making, study design, and elicit transparent conversations with regulators.

Keywords: data quality assessment, data relevance, data reliability, real‐world data, real‐world evidence, regulatory decision‐making


Summary.

  • The Guardian Research Network (GRN) aggregates electronic health records (EHRs) from a consortium of regional community health systems, providing a comprehensive database for oncology and other therapeutic areas.

  • Regulatory submission of real‐world data (RWD) should include transparent documentation of data lineage and quality assessment to support interpretability and trustworthiness.

  • GRN is mapped across multiple dimensions of reliability (accuracy, traceability, timeliness, completeness) and relevance (availability, sufficiency, representatives), as defined by Castellanos et al.

  • This exercise demonstrates how standardized quality attributes can be operationalized in a real‐world dataset with illustrative examples of quality assessment practices.

  • The findings reinforce the importance of robust, reproducible data quality assessments to inform study design and facilitate regulatory decision‐making.

1. Background

The quality of real‐world data (RWD) directly impacts the value of real‐world evidence (RWE) generated for regulatory decision‐making [1, 2, 3, 4, 5, 6]. Although robust feasibility is required to justify the relevance of a data source for a specific research question, the reliability of the data, including the chain of custody and data journey prior to reaching the end user (researcher), is of equal importance for drawing valid, meaningful conclusions. Published guidelines articulate varied definitions of RWD quality, and recent publications have attempted to summarize and compare definitions across frameworks [3, 5, 7, 8, 9, 10, 11]. While there is overlap, examples of how RWD quality can be assessed and documented in practice are needed to inform framework operationalization, particularly when the goal is regulatory submission.

In the clinical trial setting, Good Clinical Practice (GCP) and the attributes for data quality have been referred to by the acronym ALCOA‐CCEA: Attributable, Legible, Contemporaneous, Original, Accurate, Complete, Consistent, Enduring, and Available. ALCOA‐CCEA is operationalized through key study‐related workflows that are standardized, validated, and controlled through structured data collection and management methods. These workflows are tied to GCP responsibilities for the Clinical Investigator, which are reaffirmed in auditable regulatory documents and verifiable through the collection of the original “primary” data referred to as source documentation [12].

In the RWE setting, however, clinical workflows from which RWD is extracted cannot be fully standardized or validated in the same way as clinical trials, creating ambiguity around reliability assessment in RWD sources. If secondary data is used for regulatory decision‐making (e.g., supporting a new approval, fulfilling a post‐marketing requirement), then data sources and investigators must be prepared to provide documentation on data quality assessments at the behest of regulators [6, 13]. This data quality assessment documentation should include the transformation of RWD into RWE, as the methods of RWE generation from RWD can introduce bias equal to that of the data journey if not designed and carried out in a manner upholding minimum quality standards, acknowledging the inherent limitations of how and why the RWD has been generated.

Castellanos et al. constructed a definition of RWD quality by synthesizing definitions across published guidelines and guidance including the European Medicines Agency (EMA), National Institute for Health and Care Excellence (NICE), the US Food and Drug Administration (FDA), the Duke Margolis Health Policy Center, and the Patient Centered Outcomes Research Institute (PCORI) [2]. This definition was used to characterize quality attributes of Flatiron Health RWD [1, 2]. This transparent reporting of how data quality attributes are met in a single RWD source is a valuable exercise, and we seek to replicate this approach in the Guardian Research Network (GRN), a database of centralized and integrated electronic health records (EHR) from across a nationwide consortium of regional community health systems with experienced cancer research programs. This paper aims to illustrate the fundamental requirements for demonstrating data relevance and reliability to regulators by mapping GRN data and quality assessment to dimensions of data quality as defined by Castellanos et al.

2. Overview of GRN RWD

GRN is comprised of 14 health systems across the US, including 43 cancer centers and 85 hospitals in 15 states (Figure 1). GRN captures more than 5 million oncology patients (including precancer) with over 40 000 new cases each year. GRN also includes approximately 40 million non‐cancer patients, 420 million physician notes, and over 44 000 physicians and specialists. The patient pool is nationally representative, with patients from various demographic profiles. The database contains adult and pediatric populations not limited to a specific disease history, with racial distribution of the database aligning with the U.S. Census. The database is linked to the Area Deprivation Index (ADI) which demonstrates the representativeness of the database and allows for stratification of data by socioeconomic status [14, 15]. In regard to data access, GRN data are available for licensing under a Master Data License Agreement, which outlines the scientific purposes for utilization.

FIGURE 1.

FIGURE 1

Geography of health systems contributing to GRN. GRN is comprised of 14 health systems across the US, including 43 cancer centers and 85 hospitals in 15 states.

GRN was originally established for the purposes of identifying and recruiting patients into targeted oncology clinical trials. The utilization of these data for RWE generation is a unique strength of GRN compared to other RWD sources given the extent of data capture for variables commonly used as eligibility criteria in oncology studies; however, GRN data are increasingly being leveraged across a broader range of therapeutic areas, extending its value beyond oncology to support diverse evidence development needs. In GRN, data are collected across the entire integrated delivery network (IDN), meaning data generated outside of the oncology office, including inpatient and outpatient activity, non‐oncologist interactions, and hospitalizations, are captured. This provides more complete and comprehensive insights into the longitudinal patient journey, which is important for assessing safety and effectiveness in oncology indications, given the multidisciplinary nature of cancer care, as well as other therapeutic areas. IDN data from a single platform EHR contribute to the heterogeneity of the data by enabling diverse data elements to be collected by various specialties or healthcare locations. The uniform structure of EHR, once harmonized within GRN, reduces heterogeneity within the dataset. Meta‐data within the database helps add detail informing both patient journey and provider decisioning, such as the timing of scheduling an appointment or the amount of time between an ordered and completed procedure.

GRN leverages existing healthcare operations' relationships with contributing sites to go directly to the health systems for data and then aggregates and harmonizes the data on behalf of the sites to promote centralized, integrated access (Figure 2). While GRN's database contains certified copies of the EMR and overall represents comprehensive patient journeys within the health system, there are limitations based on data‐sharing agreements with its member health systems and the source EHR. This includes GRN's inability to acquire external health records (e.g., Epic Care Everywhere) on demand, as could be done at the health‐system level. Another data quality challenge includes the transition of health systems to new EHR systems, particularly in regard to how health systems manage legacy data. To mitigate issues with missing legacy data, GRN may exclude patients from a RWD study if the patient's legacy EHR data has not been maintained by the health system.

FIGURE 2.

FIGURE 2

Sources of data variables in GRN RWD. GRN leverages existing healthcare operations' relationships with contributing sites to go directly to the health systems for data and then aggregates and harmonizes the data on behalf of the sites to promote centralized, integrated access.

GRN supports the conformance of RWD to Clinical Data Interchange Standards Consortium (CDISC) standards. Once a sponsor indicates that a dataset will be submitted to the FDA, GRN will either transform the real‐world dataset to CDISC standards independently or will assist the sponsor in transforming the data. GRN considers the sponsor responsible for selecting the appropriate FDA Data Standard(s) for submission. Using primarily EHR data from U.S.‐based health systems reduces the challenges posed by regional and global differences in standards, terminologies, and exchange formats [16].

3. Data Elements in GRN

GRN captures information from a diverse range of structured and unstructured patient data, including EHR documentation that is sourced from partner health systems and other third‐party referral groups (Table 1). When leveraging GRN data for a real‐world study in oncology or other therapeutic areas, data dictionaries are built bespoke to a specific research question; however, GRN has several features and capabilities that can enhance variable completeness and longitudinal capture.

TABLE 1.

Sample data elements in GRN.

Data source Data category Data summary
Structured EMR data Demographics Sex, race, Ethnicity, age, geography, ADI
Vital status Date of death
Diagnosis ICD‐10 codes, dates of diagnosis
Encounters Inpatient and outpatient visits, height, weight, blood pressure
Medications Medication name, dose, refills, supply, dates of prescription and administration
Labs Lab name, result, reference range
Procedures Imaging and surgical procedures given
Allergies Specific allergy, reaction, severity
Providers Specialty and credential of providers treating the patient
Clinical notes and reports Clinical notes Progress notes from providers, nurse notes, telephone notes
Procedure reports Imaging reports, pathology reports, laboratory reports
NLP extracted data Oncology biomarkers Results of routine oncology biomarker testing, such as HER2, ER, PR, and PD‐L1
Social history Smoking status, smoking pack‐years
Performance status ECOG, KPS
Cancer diagnosis details AJCC staging, TNM staging, grade, histology
Test results Pulmonary function test results (e.g., FEV1, FVC), cardiology test results (e.g., LVEF)
Manual abstraction Clinical outcomes Response to therapy, disease progression, treatment complications, cause of death
Clinical decisions Reasons for discontinuation of therapy, surgical candidacy
Social and family history Education, occupation, family history of disease
Custom fields Unlimited custom data fields from information documented within clinical notes and reports

A major challenge in generating RWE for regulatory submission is the lack of transparency in collecting data from third parties, as sponsors may not have visibility into proprietary algorithms or other data cleaning activities affecting the provenance and traceability of the data [13, 17, 18]. GRN integrates all parts of the patient record, including unstructured data in physician notes, molecular sequencing, radiological images, pathological information, and lab results. These data are collected in near real‐time, thus providing a deep understanding of evolving treatment practices, patterns, and outcomes. Unstructured data, such as text in physician notes, is processed by employing natural language processing (NLP) and clinical data curation by trained clinicians who specialize in extracting relevant information from these sources. NLP approaches are helpful in extracting data that is highly standardized from clinical notes, such as disease staging, performance scores, and results of some clinical tests. As variability exists in how different providers or systems record different clinical concepts, all NLP algorithms are developed and validated by GRN to maximize accuracy. Manual curation is used to capture essential clinical endpoints, such as disease progression and treatment response rates, which are often sparsely recorded in RWD in a structured format.

FDA and EMA have emphasized the need for more accurate staging and histological details on patient diagnoses to improve the exchangeability of real‐world study populations [3, 4, 6, 7]. To conduct the comparative effectiveness of targeted therapies, GRN provides access to de‐identified, longitudinal scans and images for assessing disease progression, capturing metastatic sites, and evaluating patient response to treatment (e.g., Response Evaluation Criteria in Solid Tumors (RECIST)). Moreover, for certain RWE studies (e.g., external comparator studies), central review of scans may be considered to standardize clinical assessments performed locally by clinicians/radiologists. In GRN, the availability of tissue samples, which can be linked to prior immunohistochemistry genetic testing and associated clinical data, allows for biomarker discovery, testing companion diagnostics, and identification of previously untested markers. Because specimens are collected as part of the standard of care under a separate and distinct Institutional Review Board (IRB) protocol and consent waiver, DNA and whole‐exome sequencing from paraffin‐embedded, surgical tumor specimens can be securely linked with clinical data such as mutation status, side effects, and treatment response [19]. For secondary research, including the collection of archival tissue specimens and associated data, GRN maintains an approved BioBank Protocol with the Western Copernicus Group IRB (WCGIRB) titled The Guardian BioBank Initiative: Accelerating Clinical Research and Access to Clinical Trials.

4. GRN Data Quality

Generating RWE for regulatory decision‐making relies on the use of RWD that is relevant and reliable to generate evidence for a specific research question [3, 6, 7, 18]. Because RWD is sourced from varying clinical practices, EHR aggregation and harmonization techniques like those applied by GRN play a crucial role in transforming raw data into research‐ready RWD. Application of these techniques involves characterizing both raw data and source systems, standardized terminology mapping and harmonization of data across systems, and derivation of data‐phenotypes, or research‐ready measures. The quality checks described in this paper are implemented across this lifecycle to ensure appropriate use of RWD and reproducible evidence.

To demonstrate alignment of GRN RWD to published frameworks, the sub‐dimensions of data quality and associated definitions identified through the targeted review by Castellanos et al. have been adopted and applied (Table S1). Although numerous data quality frameworks and definitions exist, this paper uses the dimensions of data quality as defined in Castellanos et al. given we are performing a similar exercise of mapping data quality dimensions to corresponding quality checks and processes in GRN. Examples are provided in the text as well as in Table 2 to further illustrate the implementation of these data quality dimensions in practice.

TABLE 2.

Data reliability subdimensions in GRN.

Data type Structured Unstructured Reliability considerations and examples
Processing method Harmonization Human abstraction ML/NLP extraction
Accuracy

GRN harmonizes EMR data to standardized ontologies and widely recognized coding systems such as International Classification of Diseases (ICD), RxNorm, Logical Observation Identifiers Names and Codes (LOINC), and Current Procedural Terminology (CPT). These standardized frameworks allow for accurate categorization and alignment of data elements across sources.

GRN regularly updates these code sets from source to ensure we have the latest and most accurate mappings. Manual review takes place for data elements that cannot be mapped by automated GRN processes. GRN's comprehensive data lake stores the source data in its original form to allow for periodic review of accuracy against the source data. GRN implements rigorous source validation processes to ensure that the harmonized data is accurate, reliable, and ready for comprehensive analysis, supporting high‐quality research outcomes. GRN's data quality and validation tools allow this source validation process to be timely and effective.

Chart abstraction begins with the development of a curation specification document that defines every data element for collection, including operational definitions of data elements. Curators are trained on the specification guidelines to ensure consistency of data abstraction across individual abstractors.

Abstracted data is entered into a proprietary data collection system, compliant with FDA 21 CFR Part 11. Within the data collection system, automated validation checks enable real‐time feedback to clinical curators if data entry does not meet the validation requirements. The data collection system allows for dual entry, where two different clinical abstractors input the same information. The system validates that both entries match, and in cases of discrepancies, the abstracted information is adjudicated by a third reviewer.

Once abstracted data is validated, specifically evaluating adherence to project specifications and disease criteria.

Validation of the NLP pipelines are conducted using human subject matter experts as the reference standard, ensuring the model's outputs are compared against trusted information. Performance metrics such as accuracy, precision, sensitivity, specificity, and the F1 score are calculated to evaluate the model's effectiveness. NLP queries are re‐verified periodically to assess the NLP pipeline's accuracy to ensure it maintains the required performance level. This process ensures the model's reliability and consistency over time. Data may be considered accurate when verification checks (Table 3) fail without a reasonable explanation. For example, a lab result that is significantly outside of reference ranges may indicate an inaccurate result or may indicate an actual significant finding. Or, if a patient receives has a record of a medication prior to the date of approval of that medication, this may indicate inaccuracy or receipt of the therapy in an off‐label setting or as part of a clinical trial. Access to clinical notes to validate unusual findings from the data help to confirm accuracy issues or provide important context for a true result.
Completeness

GRN collects complete data from current and legacy data sources, including EMR systems, PACS, interface engines, and media storage. This includes, but is not limited to, discrete data, scanned documents, and medical images (DICOM). Once gathered, GRN adds completeness to this data via interfaces to other sources, including a Master Patient Index, which links patients from across hospital systems to a single best record.

GRN links patients to external data sources such as the Social Security Death Index and the Area Deprivation Index. Automated data ingestion checks identify data gaps and allow GRN's harmonization processes to be routinely evaluated and improved.

Within the data collection system, logic checks are put in place to ensure that required fields are completed and values are within acceptance range. Data curation also enables greater visibility into the completeness of data, by allowing curators to report a data element as “Not Available”, such as those that are missing or not found in the EMR. This represents truly missing data. Alternatively, Curators may select “Not Available/Pending” if the data elements are not present in the EMR data the time of data curation but are expected to be resulted in the future. After the initial completion of curation, the Quality Assurance team may issue queries into the data collection system to inquire about missing data elements. GRN collects a wide range of documents, reports, and notes, including but not limited to pathology reports, radiology reports, and progress notes, all of which are processed daily through GRN's NLP pipelines. These documents are gathered from hospital systems to ensure comprehensive data coverage. The NLP queries developed by GRN are designed with high sensitivity, allowing extract of detailed information from unstructured text. The results of these queries are then classified into key categories: current condition, historical, family history, and hypothetical categories. By extracting critical information from unstructured data sources that are not available in discrete datasets, GRN NLP significantly enhances the completeness of patient records, providing a more complete picture of a patient's health journey. Data may be considered incomplete when there is a significant gap in EMR activity during a period of expected healthcare interactions. This may be due to a patient seeking care at an external health system, in which case records are available from clinical notes only. Curation of the clinical notes may improve the completeness of a patient record. Patients with significant gaps in data may be censored or excluded from an analysis.
Traceability

GRN adopts a standardized yet flexible approach to data aggregation to ensure that data is consistently processed and integrated, regardless of the specific methods or algorithms employed. If a sponsor intends for GRN to use a specific algorithm to create a dataset, GRN will collaborate with the sponsor to identify potential risks associated with the algorithm and develop tailored plans to mitigate these risks.

Source patient IDs and source record IDs are maintained throughout the record lifecycle, allowing both the patient and the individual components of the patient's medical record to be traced back to their source. When GRN identifies and merges duplicate patients from across multiple health systems, the original non‐harmonized data remains available for review as needed. Additionally, source indicators allow records to be traced back to the technology that produced the record, such as interface engine, database query, or file‐based extract. Record hashes allow GRN to track new versus changed records and allows review of changes to a patient record over time.

The data collection system maintains an audit trail which includes information on the user(s) that enter data, deletions, and alterations of curated data, as well as when data entry occurred, and device from which the entry occurred.

With GRN NLP, NLP results are carefully linked back to their original data sources for full traceability and validation. Each extracted result is accompanied by a snippet of the text from which it was derived, ensuring that users can validate the findings and refer back to the source document, report, or note if needed.

GRN NLP employs version control and industry‐standard change control mechanisms for any code or query releases into the production environment. This ensures that all changes are documented, reviewed, and controlled, providing a stable and auditable production system.

Data within GRN's data warehouse is all traceable back to the health system and EMR software that the data was recorded in. However traceability may fail when patient‐reported data that is recorded within the EMR, specifically patient reporting of medical history, such as surgeries from 20 years prior, or current medications prescribed by a provider outside of the GRN network. Records that lack traceability may be easily identified by failures of completeness as well, such as a medication record without a dose or a surgery reported by the year only.
Timeliness Data is available within 24 h of entry into the source system. GRN leverages a combination of real‐time interfaces, self‐service data feeds via our Guardian Fusion application, and data that is pushed to GRN via SFTP or similar. GRN leverages cloud technology to horizontally scale pipelines and deliver accurate data in a concise time window. Custom dashboards and automated reporting systems allow GRN to maintain 24/7 monitoring and respond promptly to any issues or delays. Data is available within 24 h of entry into the source system. Historic data is available to provide a comprehensive patient picture to curators. If new data is available after the chart is curated, the data collection system flags new data availability and requires updated curation. GRN NLP pipelines are designed to run daily, processing new data and ensuring that the results are promptly delivered to GRN products each morning at the start of the business day. This ensures that the latest insights are available for decision‐making and analysis as soon as possible. To maintain the reliability and efficiency of this process, custom dashboards and automated reporting systems are in place to provide continuous 24/7 monitoring of the pipeline's performance. These systems allow for real‐time tracking and immediate detection of any issues or delays. Should any disruptions occur, alerts are triggered, enabling teams to respond and resolve problems before they impact operations. This combination of automated daily processing, around‐the‐clock monitoring, and rapid issue resolution ensures the seamless integration of NLP results with GRN products, contributing to the overall stability and effectiveness of the system. 24/7 monitoring of data pipelines helps to mitigate concerns over timeliness, as these issues are identified and resolved quickly. Data coming from non‐EMR systems, such as that from the Social Security Death Index has documented timeliness issues (4–6 month lag in reporting) which may lead to underreporting of death data if not captured by the EMR alone [20].

4.1. Reliability

Reliability is defined as the degree to which data represent the clinical concept intended and is assessed using four sub‐dimensions: accuracy, completeness, provenance, and timeliness [1]. As part of feasibility, the reliability of the data independent of a specific study design or use case should be assessed leveraging not only publicly available information and documentation, but also thorough correspondence with the data holder. Although critical for initial feasibility, this assessment does not guarantee that the data are sufficiently ‘reliable’ until final data abstraction or analyses are complete [21]. To that end, data accrual (how the data is handled for a study specific purpose) and its documentation are an important aspect of demonstrating reliability to regulators at the time of submission and audit. The following sections and corresponding tables (Tables 2 and 3) provide details on the information that it would be useful to share with regulators when describing the reliability of a selected RWD source.

TABLE 3.

Sample of verification checks in GRN.

Category Subcategory Description Example verification check Example resolution (if verification check failure)
Conformance Value conformance Data values conform to internal formatting constraints Numeric fields should not include operators (< or >) Ascertain reason operator utilized. If adds clinical value to dataset, consider updating data model.
Data values conform to allowable values or ranges Race and ethnicity are reported in accordance with US Census Standardization Race and ethnicity results that fall outside of US Census categories are classified as “Other”
Relational conformance Data values conform to relational constraints Breast Cancer Patients treated with Endocrine therapy have a documented hormone receptor positive diagnosis. Chart is reviewed for clinical justification for receipt of endocrine therapy without hormone receptor diagnosis.
Unique (key) data values are not duplicated. Provider ID is associated with one provider Provider record evaluated for logical justification, such as change in name or change in specialty.
Changes to the data model or data model versioning. Dataset adheres to the latest version of the data model Change log is consulted and dataset is updated to adhere to data model.
Computational conformance Computed values conform to programming specifications Human‐abstracted adverse events (AEs) are assigned a grade that aligns with CTCAE criteria for the specific AE when clinical documentation (such as lab results) allows. Curated record is reviewed and updated as necessary.
Plausibility Uniqueness plausibility Data values are not duplicated that should remain unique A single patient's record is reported only once in a dataset. If dataset details demonstrate two or more patients with identical or nearly identical data, records are reviewed to confirm uniqueness.
Atemporal plausibility Data values and distributions agree with internal measurement or local knowledge Distribution of patients by stage at diagnosis mirror national cancer registry data Records are reviewed to confirm accuracy of stage reporting. Curated record updated as necessary.
Data values and distributions for independent measurements of the same fact are in agreement. Patients reported to have BMI over 30 are also reported as having obesity as a comorbidity. Record is reviewed to confirm obesity diagnosis, including any clinical justification that a provider has not diagnosed the patient as obese.
Logical constraints between values agree with local or common knowledge (includes “expected” missingness) Patients reported to have a pregnancy are within expected child‐bearing years. Record is reviewed for justification and updated as required.
Values of repeated measurement of the same fact show expected variability. Body Surface Area (BSA) measurements taken within the same facility over the course of a month show a maximum change of 0.1 m [2]. Records are reviewed to ensure that repeated BSA measurements over a one‐month period are consistent, with any change exceeding 0.1 m [2] flagged for follow‐up. This review includes checking for potential calculation errors, patient weight/height errors.
Temporal plausibility Values conform to expected logical temporality Date of treatment is later than Date of diagnosis Record is reviewed for duplicate diagnoses or justification for treatment prior to diagnosis. Curation record updated if no justification supports temporality failure.
Clinical events follow logical temporality Screening events occur prior to diagnostic events Record is reviewed for justification of temporality failure. Curation record updated if no justification identified.
Clinical interventions occur at a rate consistent with commercial availability. Patients are not reported to have treatment with a therapy prior to the therapy's regulatory approval. Record is reviewed for justification of temporality failure, such as use in a clinical trial. Curation record updated if no justification identified.
Consistency Cross field consistency Data are consistent across multiple data fields A patient reported to have an incident diagnosis of diabetes has A1C test results consistent with diabetes Record is reviewed to confirm diagnosis is incident rather than a prevalent diabetes diagnosis that is being effectively controlled.
Temporal consistency Addition of new records occurs at a consistent rate Monthly disease‐specific counts are evaluated to identify volume of patients in the network with specific conditions Data harmonization team is consulted to confirm whether new data has become available or to understand
Agreement Structured and Curated data result in the same values Curation result for TNM stage matches NLP‐derived cancer stage Secondary reviewer checks NLP and curated data and confirms correct response. Tuning to NLP query or re‐training of curation staff implemented.
Reproducibility Repeat use of the same record yields same or similar results Patients identified with a chronic condition in an initial dataset should continue to have the chronic condition in refreshed datasets. EMR record is evaluated for update within the audit log to ascertain when the record change occurred. Health systems frequently update ICD‐10 codes within EMR records to improve accuracy of billing and reimbursement.

4.2. Accuracy

Accuracy can be defined as the closeness of agreement between the measured value and the true value of what is intended to be measured [1]. Accuracy is addressed using validation approaches, such as comparison with external or internal reference standards or indirect benchmarking, and verification checks for conformance, consistency, and plausibility, as defined by Castellanos et al. (Tables S1 and 3) [1]. To promote accuracy, each study utilizing GRN data is guided by custom curation protocols that outline specific instructions and rules for curating each data element. GRN applies quality controls to assess any issues related to the data, which include a series of verification checks. GRN implements validation processes to verify the integrity and accuracy of the aggregated datasets, ensuring they meet required standards for research and analysis and facilitating future integration with other datasets, both domestically and internationally (Table 2) [16]. Of note, GRN's standard data can be tokenized to allow for the combination of EHR and claims data, and deduplication of records across multiple data sources as a means of bolstering data accuracy [16].

As in any EMR system, accuracy issues due to miscoding or non‐specific coding may result from reliance on ICD‐10 code alone [22]. To address this well‐known limitation, GRN's data curation processes allow for the validation of certain diagnoses. For example, a patient may have a BMI measurement that indicates obesity but may not have an ICD‐10 code indicating an obesity diagnosis. In these examples, additional data elements may be cross‐referenced to ascertain whether or not the patient is truly obese, such as the presence of > 1 BMI record in the obese range, or the addition of other body composition data elements, such as waist circumference and muscle mass. This more comprehensive record provides researchers utilizing the database with more confidence that the cohort of patients being studied is accurately diagnosed [23].

4.3. Completeness

Completeness is the presence of data values, meaning data value frequencies, without reference to actual values themselves [1]. Within the data collection system, logic checks are put in place to ensure that required fields are completed and values are within an acceptable range (Table 2). The threshold for what is considered acceptable completeness for a given research question will depend on how a variable is being used in the analysis. A useful rule of thumb is to set thresholds for missing data, which may vary based on the prognostic importance of particular variables and how those variables are being used (e.g., eligibility criteria, propensity score matching). For example, variables for matching with missingness under 5%–10% can likely be included in most situations, whereas those with ≥ 30% missingness should likely be excluded from any matching analyses, and the utility of the data source may need to be reconsidered [24]. Missingness in the outcome variable (e.g., missing assessment of response) may be less well‐tolerated as it cannot be properly estimated or adjusted for [21].

Importantly, in secondary data, completeness is dependent on what was entered into the EMR by the treating provider. GRN provides a unique opportunity to improve the completeness of certain variables under the aforementioned BioBank Protocol. Within the oncology setting, emerging biomarkers continue to grow in clinical importance; ascertaining real‐world prevalence may be difficult when evaluating historic health records, as they are not routinely assessed. In a study assessing human epidermal growth factor receptor 3 (HER3) expression in non‐small cell lung cancer, GRN provided archival tissue specimens for retrospective testing [25]. Additionally, GRN utilizes NLP to parse for staging or histology, two variables that are often missing from EMR data (see Table 2 for information on NLP validation).

4.4. Provenance (Traceability)

Traceability refers to an audit trail that accounts for the origin of a piece of data (in a database, document, or repository) together with an explanation of how and why it got to the present place [1]. Castellanos et al. uses the term provenance, which is consistent with the language used in FDA draft guidance; however, the FDA guidance document “Real‐World Data: Assessing Electronic Health Records and Medical Claims Data to Support Regulatory Decision‐Making for Drug and Biological Products” finalized in July 2024 uses the term traceability instead of provenance, which we adopt here [1, 7]. When leveraging RWD for a regulatory submission, documentation of data curation and transformation processes should be in place. This may include electronic documentation (e.g., audit trails, quality control procedures, etc.) of data additions, deletions, or alterations from the source data system to the final study analytic data set(s). The GRN data collection system maintains an audit trail of user activity, including data entry, deletions, and alterations of curated data. Data transformations, such as anonymizing dates to maintain patient privacy, are documented and auditable [16].

4.5. Timeliness

Timeliness means that data are collected and curated with acceptable recency such that the data set represents reality during the period of coverage [1]. In GRN, data is available within 24 h of entry into the source system, ensuring that data are available for analysis and decision‐making as soon as possible. Historic data is available to provide a comprehensive patient view to curators. If new data is available after the chart is curated, the data collection system flags new data availability and requires updated curation (Table 2). In the case of inspections, audits, or questions from regulators, it is important to be able to provide additional data or analyses in a timely manner if requested.

5. Relevance (Availability, Sufficiency, Representativeness)

While the reliability of GRN RWD has been discussed in detail, there are also approaches to increasing the relevance of GRN data for a specific research question to bolster the availability of critical variables as well as the sufficiency and representativeness of the population [1]. The ability to customize GRN data creates flexibility to improve data relevance, for example by enriching secondary data with primary data collection (e.g., including patient reported outcomes (PROs)). With increasing regulatory data relevance burden (e.g., use of RWD for external comparator studies), interpretability and validation of endpoints, and exchangeability of real‐world study populations may require primary collection of data. This approach may be necessary in cases where missingness of key study variables does not meet predetermined thresholds (see Completeness), or covariates are not routinely captured in clinical practice but are of interest to regulators.

Data sets (and associated data dictionaries) may be designed to address specific use cases. FDA recommends describing prior use of the selected data source for research purposes (e.g., previous submissions to FDA by the sponsor or relevant examples in the published literature) in the protocol, including a description of how well the selected data source has been shown to capture study variables and how the study variables can be validated for a particular research activity [7]. GRN RWD has been leveraged for various study designs including external comparator studies, treatment pattern analyses, and comparative effectiveness studies [26, 27, 28, 29]. These studies may support a variety of regulatory use cases, including label expansions and post‐marketing commitments.

6. Data Source Feasibility Considerations

When selecting RWD sources for a specific research purpose, sponsors should begin with a robust feasibility assessment. In general, this involves (1) defining the research question and regulatory strategy, (2) identifying potential data sources, (3) selecting based on a minimum set of a priori defined criteria, (4) justifying the source, and (5) engaging with regulators as needed on the proposed RWE approach and/or evidence strategy. Importantly, data quality is risk proportionate: while we should aim for the highest quality possible, the threshold for acceptability should vary by use case (e.g., publication vs. regulatory submission). A data source fit for one purpose may not meet the threshold for another, and prior use in a regulatory submission does not guarantee future acceptance. The limitations extend to this analysis, as using the Castellanos et al. framework of data quality may limit applicability to feasibility assessments leveraging other data quality frameworks or definitions. Further, data source feasibility assessments should acknowledge that technical transformations applied to RWD sources between source documentation and analytic use may impact appropriate use. Even high‐quality sources (e.g., with complete capture of key study variables) may be insufficient if the sample size is too small or unrepresentative of the target population. In such cases, supplementary evidence or repositioning the RWE as supportive may be necessary. We also recommend that sponsors request a sample of source documents (especially from unstructured data) during feasibility to verify whether the research question and/or approach needs to be modified based on what is available in the source data. While not all RWD providers offer access to deidentified source documentation, this is an important consideration for data source selection as it facilitates investigation into data anomalies that may be identified during the course of research.

7. Conclusion

Structured approaches to identifying fit‐for‐purpose data underscore the need for comprehensive information about candidate data sources at the feasibility stage to inform decision making, study design, and transparent conversations with regulators. While varying definitions of data quality dimensions crowd the space for unifying on a gold‐standard definition of high‐quality RWD, traceability of data lineage is foundational to enabling accurate quality assessments of nearly all other dimensions of data quality.

When using secondary data, RWD extraction methods and data collection tools should incorporate both access controls and traceability features, including audit trails. Where such controls are lacking, the risks to data quality and integrity must be carefully assessed to determine regulatory suitability [30]. Researchers should define quality expectations a priori and assess the quality of the data received from a data provider prior to initiating analyses. This process helps identify data quality issues that may be addressed by the data provider or, in some cases, signals that the data are not suitable despite initial feasibility assessments.

This manuscript demonstrates how guidance on RWD quality can be operationalized in practice by describing GRN, a real‐world EHR database that has been used for regulatory decision making in oncology but also across a range of therapeutic areas for which there may be various evidence needs. Continued transparency from RWD providers, aggregators, and users in sharing their approaches to ensuring and evaluating data quality will support broader understanding and alignment. Further guidance is needed on best practices for responding to regulatory requests and ensuring audit readiness in the context of RWD use.

7.1. Plain Language Summary

Real‐world data (RWD) refers to health information collected outside of traditional clinical trials, such as electronic health records (EHRs). When researchers use this type of data to generate real‐world evidence (RWE) for decisions about drug safety or effectiveness, it is important that the data are both reliable and relevant to the research question. Regulatory agencies like the FDA increasingly expect researchers to explain how they have evaluated the quality of RWD sources used in their studies.

This paper describes how the Guardian Research Network (GRN), a health data network that brings together EHR data from regional cancer centers, ensures and documents the quality of its data. We use a framework and definitions previously applied to another EHR data source, which categorize data quality into two main criteria: reliability (how accurate, complete, timely, and traceable the data are) and relevance (whether the data are available, sufficient, and representative for a specific use).

We show how GRN meets these criteria using real examples of quality checks and documentation practices. For example, we describe how GRN ensures that patient records are accurate, how quickly new data become available, and how the patient population compares to the general population.

By clearly reporting these data quality measures, we aim to support the growing need for transparency in RWE studies, especially those submitted for regulatory purposes. This helps researchers and regulators make more informed decisions about whether a given data source is appropriate for specific research use.

Ethics Statement

All authors meet the authorship criteria as defined by PDS. Each author made a significant contribution to the work reported, participated in drafting or critically revising the article, agreed on the journal submission, reviewed and approved all versions of the manuscript, and accepts responsibility for the integrity of the work.

Conflicts of Interest

Andrea McCracken and Charlie Hurmiz are employees of Guardian Research Network (GRN), the organization whose data quality attributes are described in this manuscript. GRN aggregates and licenses real‐world data for research and regulatory purposes. Andrea and Charlie declare that this study was conducted objectively and without external influence on the findings.

Supporting information

Data S1: pds70202‐sup‐0001‐Supinfo.

PDS-34-e70202-s001.docx (29.7KB, docx)

McCracken A., Heidt J., Eldridge E., et al., “The Guardian Research Network: A Real‐World Data Source for Pharmacoepidemiologic Research and Regulatory Applications,” Pharmacoepidemiology and Drug Safety 34, no. 9 (2025): e70202, 10.1002/pds.70202.

Funding: The authors received no specific funding for this work.

References

  • 1. Castellanos E. H., Wittmershaus B. K., and Chandwani S., “Raising the Bar for Real‐World Data in Oncology: Approaches to Quality Across Multiple Dimensions,” JCO Clinical Cancer Informatics 8 (2024): e2300046, 10.1200/CCI.23.00046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lerro C. C., Bradley M. C., Forshee R. A., and Rivera D. R., “The Bar Is High: Evaluating Fit‐For‐Use Oncology Real‐World Data for Regulatory Decision Making,” JCO Clinical Cancer Informatics 8 (2024): e2300261, 10.1200/CCI.23.00261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. European Medicines Agency , “Reflection Paper on Use of Real‐World Data in Non‐Interventional Studies to Generate Real‐World Evidence,” (2024).
  • 4. European Medicines Agency , “Data Quality Framework for EU Medicines Regulation: Application to Real‐World Data,” (2024).
  • 5. Food and Drug Administration , “Use of Real‐World Evidence to Support Regulatory Decision‐Making for Medical Devices,” (2017).
  • 6. Food and Drug Administration , “Considerations for the Design and Conduct of Externally Controlled Trials for Drug and Biological Products,” (2023).
  • 7. Food and Drug Administration , “Real‐World Data: Assessing Electronic Health Records and Medical Claims Data to Support Regulatory Decision‐Making for Drug and Biological Products,” (2024). [DOI] [PMC free article] [PubMed]
  • 8. Food and Drug Administration , “Real‐World Evidence: Considerations Regarding Non‐Interventional Studies for Drug and Biological Products,” (2024).
  • 9. Riskin D. J., Monda K. L., Gagne J. J., et al., “Implementing Accuracy, Completeness, and Traceability for Data Reliability,” JAMA Network Open 8, no. 3 (2025): e250128, 10.1001/jamanetworkopen.2025.0128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Nafie M., Parker V. J., McClellan M., and Hendricks‐Sturrup R. M., “A Brief Report on Proposed Areas of International Harmonization of Real‐World Evidence Relevance, Reliability and Quality Standards Among Medical Product Regulators,” Pharmacoepidemiology and Drug Safety 34, no. 3 (2025): e70127, 10.1002/pds.70127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Bian J., Lyu T., Loiacono A., et al., “Assessing the Practice of Data Quality Evaluation in a National Clinical Data Research Network Through a Systematic Scoping Review in the Era of Real‐World Data,” Journal of the American Medical Informatics Association 27, no. 12 (2020): 1999–2010, 10.1093/jamia/ocaa245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. IQVIA , “Real‐World Evidence (RWE) Benchmarking Quality and Compliance Standards,” (2022).
  • 13. Food and Drug Administration , “Considerations for the Use of Real‐World Data and Real‐World Evidence to Support Regulatory Decision‐Making for Drug and Biological Products,” (2023).
  • 14. Health. UoWSoMaP , “Area Deprivation Index 2024,”.
  • 15. Kind A. J. H. and Buckingham W. R., “Making Neighborhood‐Disadvantage Metrics Accessible–The Neighborhood Atlas,” New England Journal of Medicine 378 (2018): 2456–2458, 10.1056/NEJMp1802313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Guardian Research Network , “Guardian Research Network's Implementation of FDA Guidance: Data Standards for Drug and Biological Product Submissions Containing Real‐World Data Guidance for Industry,” (2024), https://www.guardianresearch.org/blog_posts/guardian‐research‐networks‐implementation‐of‐fda‐guidance.
  • 17. Food and Drug Administration , “Data Standards for Drug and Biological Product Submissions Containing Real‐World Data,” (2023).
  • 18. Food and Drug Administration , “Real‐World Data: Assessing Registries to Support Regulatory Decision‐Making for Drug and Biological Products,” (2023).
  • 19. IQVIA , “Discover IQVIA's Oncology Collaboration With Guardian Research Network (GRN),” (2021).
  • 20. Levin M. A., Lin H. M., Prabhakar G., McCormick P. J., and Egorova N. N., “Alive or Dead: Validity of the Social Security Administration Death Master File After 2011,” Health Services Research 54, no. 1 (2019): 24–33, 10.1111/1475-6773.13069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Curtis L. H., Sola‐Morales O., Heidt J., et al., “Regulatory and HTA Considerations for Development of Real‐World Data Derived External Controls,” Clinical Pharmacology and Therapeutics 114, no. 2 (2023): 303–315, 10.1002/cpt.2913. [DOI] [PubMed] [Google Scholar]
  • 22. Horsky J., Drucker E. A., and Ramelson H. Z., “Accuracy and Completeness of Clinical Coding Using ICD‐10 for Ambulatory Visits,” American Medical Informatics Association Annual Symposium Proceedings 16 (2017): 912–920. [PMC free article] [PubMed] [Google Scholar]
  • 23. Richter J., Davids M. S., Anderson‐Smits C., et al., “Burden of Infection in Patients With and Without Secondary Immunodeficiency Disease Following Diagnosis of a Mature B Cell Malignancy,” Clinical Lymphoma Myeloma and Leukemia 24, no. 8 (2024): 553–563, 10.1016/j.clml.2024.04.002. [DOI] [PubMed] [Google Scholar]
  • 24. Dong Y. P. and Peng C. Y. J., Principled Missing Data Methods for Researchers, vol. 2 (Springerplus, 2013), 222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Soo R. A., Clinthorne G., Santhanagopal A., et al., “HER3 Is Widely Expressed Across Diverse Subtypes of NSCLC in a Retrospective Analysis of Archived Tissue Samples,” Future Oncology 20, no. 37 (2024): 2961–2970, 10.1080/14796694.2024.2398983. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Watson M. L., “Assessing the Relationship in Relapsed‐Refractory Multiple Myeloma Between Response, Progression, and Survival Between Pooled Clinical Trial Subjects and a Real‐World Electronic Medical Record Data Source,” Journal of Clinical Oncology 38 (2020): e20525. [Google Scholar]
  • 27. Van Le H., Naarden Braun K., Nowakowski G. S., et al., “Use of a Real‐World Synthetic Control Arm for Direct Comparison of Lisocabtagene Maraleucel and Conventional Therapy in Relapsed/Refractory Large B‐Cell Lymphoma,” Leukemia and Lymphoma 64, no. 3 (2023): 573–585, 10.1080/10428194.2022.2160200. [DOI] [PubMed] [Google Scholar]
  • 28. Wing V. K., “HSR22‐179: Treatment Patterns of Patients With Soft Tissue Sarcoma Progressing to Third Line of Therapy,” Journal of the National Comprehensive Cancer Network 20, no. 3.5 (2022): HSR22‐179. [Google Scholar]
  • 29. Shah N., Sussman M., Crivera C., Valluri S., Benner J., and Jagannath S., “Comparative Effectiveness Research for CAR‐T Therapies in Multiple Myeloma: Appropriate Comparisons Require Careful Considerations of Data Sources and Patient Populations,” Clinical Drug Investigation 41, no. 3 (2021): 201–210, 10.1007/s40261-021-01012-x. [DOI] [PubMed] [Google Scholar]
  • 30. Grandinetti C., Rivera D. R., Pai‐Scherf L., et al., “Keeping the End in Mind: Reviewing U.S. FDA Inspections of Submissions Including Real‐World Data,” Therapeutic Innovation & Regulatory Science (2025), 10.1007/s43441-025-00791-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1: pds70202‐sup‐0001‐Supinfo.

PDS-34-e70202-s001.docx (29.7KB, docx)

Articles from Pharmacoepidemiology and Drug Safety are provided here courtesy of Wiley

RESOURCES