Skip to main content
Health Science Reports logoLink to Health Science Reports
. 2025 Sep 22;8(9):e71272. doi: 10.1002/hsr2.71272

Combining Real‐World and Clinical Trial Data Through Privacy‐Preserving Record Linkage: Opportunities and Challenges—A Narrative Review

Michael Batech 1, Ann Madsen 1, Nicolle Gatto 1, Tancy C Zhang 1, Deborah Ricci 2, Raymond Harvey 2,, Najat Khan 2, Sid Jain 2
PMCID: PMC12453958  PMID: 40994776

ABSTRACT

Background and Aims

Despite their widespread use, randomized clinical trials (RCTs) face challenges like differential loss to follow‐up, which can impact validity. Real‐world evidence (RWE) from real‐world data (RWD) is increasingly used to address these limitations, but RCTs and RWE have provided complementary, disconnected observations of the patient journey. Privacy‐preserving record linkage (PPRL) enables the integration of patient records across these data sources. This narrative review explores the potential use cases of PPRL to overcome the limitations of both RCTs and RWD for clinical research and regulatory decision‐making.

Methods

This manuscript is a narrative review and did not involve the collection or analysis of primary research data. The authors aimed for comprehensive topic coverage and a synthesis of key concepts from the current literature, rather than adhering to a formal systematic review protocol (e.g., PRISMA).

Results

PPRL can generate a more comprehensive understanding of patient interaction with the healthcare system. For example, long‐term information about participants before and after a trial can assist in identifying predictors of drug response or intolerance, reducing patient burden, and providing alternatives to traditional study designs. Linked data applications include expanding patient health histories and creating comprehensive patient data repositories that enable innovative trial designs. However, opportunities remain to demonstrate the provenance, quality, and completeness of RWD sources to ensure scientific rigor.

Conclusion

Combining RCTs and RWD through PPRL offers significant and insufficiently explored potential for advancing drug development research, reducing operational costs, and enhancing data availability. Further consideration of PPRL use cases may drive innovative trial designs augmented with RWD, improving the ability of this collected data to support informed decision‐making.

Keywords: big data, drug development, linked data, observational study, patient data privacy, secondary data analysis

Summary

  • Combining randomized clinical trials (RCTs) and real‐world data (RWD) through privacy‐ preserving record linkage (PPRL) offers significant and insufficiently explored potential for advancing drug development research, reducing the patient burden and operational costs, and enhancing data availability and quality.

  • Potential applications of linked data include expanding patient health history, comparing data validity, and creating comprehensive patient data repositories enabling innovative trial designs and informed decision‐making both during pre‐ and post‐marketing phases of drug development. Further use cases have yet to be developed or explored and further research and lessons learned from using PPRL are still needed.

1. Overcoming the Limitations of Randomized Clinical Trials and Real‐World Data by Working Together

Real‐world data (RWD), derived from individual encounters with the healthcare systems, devices, and surveys (including patient reported outcomes; PROs), are widely used in medical research and have become an increasingly trusted source of evidence among health researchers, pharmaceutical and medical device manufacturers, and other healthcare‐related companies since the US Food and Drug Administration (FDA) issued guidance [1, 2] about the use of RWD in regulatory decision‐making in response to The United States (US) 21st Century Cures Act of 2016 [3, 4]. Research on the risks and benefits of medical products that is based on the secondary use of data collected during the course of routine clinical practice or for healthcare administration is referred to as real‐world evidence (RWE) [5] and are intended to assess effectiveness. Clinical trials, in contrast, collect data prospectively in controlled settings, where study participants are assigned to an intervention to rigorously assess the efficacy and safety of medical interventions. In randomized controlled trials (RCTs), patients are randomized to treatment groups to minimize bias and confounding variables [6].

In 2018, the FDA created a framework for evaluating the potential for RWE to support label expansion of approved drugs or biological products and to satisfy post‐approval study requirements [7]. These and other guidance documents have largely discussed the role of RWD as complementary to, or as a substitute for RCTs [7, 8, 9, 10]. In the following sections, we summarize some known limitations of RCTs and RWD and, in this context, consider the potential gain of information from combining individual‐level RCT data with longitudinal RWD. We also provide an overview of privacy‐preserving record linkage (PPRL) of patient‐level data and related considerations. While this manuscript is a narrative review offering a comprehensive overview of the current literature on combining real‐world and clinical trial data through privacy‐preserving record linkage, the authors aimed for broad topic coverage and synthesis of key concepts rather than adhering to a formal systematic review protocol (e.g., PRISMA).

2. Clinical Trials, Their Utility, and Their Limitations

RCTs are the gold standard for evaluating the risks and benefits of interventions aimed at improving human health (e.g., medicines, vaccines, medical devices, etc.). In sufficiently large samples, the random assignment of participants balances sources of random and systematic error, for example, potential measured or unmeasured confounders, across study arms such that any observed differences in study outcome rates ought to reflect the average effect of the intervention in the study population [11]. Randomization, combined with primary data collection, outcome adjudication, and blinding maximize the internal validity of safety and efficacy comparisons between medical products in vivo. RCTs are therefore the gold standard for informing internal validity, the degree of confidence that the causal relationship is not influenced by confounding, and regulatory decision‐making [12]. Nonetheless, important limitations include operational challenges and high costs that may render RCTs infeasible or reduce the internal or external validity.

With respect to feasibility, trials may struggle to enroll participants due to rarity of condition, inability or unwillingness to follow protocol, or lack of access to study sites [13, 14, 15]. This may prevent a trial from being conducted or require an ongoing study to end prematurely due to futility. Studies requiring long‐term follow‐up are particularly costly and require substantial resources to ensure continued patient follow‐up despite patient relocation or changes in health plan coverage. Additionally, sample size requirements may increase due to anticipated loss to follow‐up due to competing risks (e.g., death or other exposures related to the outcome or treatment). Sample size estimates assuming an average effect estimate, but a treatment may have different risks or benefits of effect in certain patient subgroups (e.g., super‐ responder), and capturing all potential characteristics that define these subgroups may not be feasible or foreseeable in the trial design, especially in consideration of the trade‐off of data quantity versus data quality.

Factors that impact internal validity include differential patient drop‐out and non‐adherence, for example, excess drop‐out or non‐adherence in the treatment arm due to adverse events or in the control arm due to lack of perceived benefit [16, 17, 18, 19]. These conditions result in informative censoring and potentially biased results.

External validity or generalizability of RCTs is impacted by the fact that participants are self‐selected and willing to consent and adhere to study protocols, as well as by strict inclusion/exclusion criteria to maximize power of a study [19, 20, 21]. Additionally, the protocolized outcome measurement of RCTs limits the generalizability of findings to real‐world clinical practice where advanced, often expensive, assessments may not be used. Similarly, some benefits and risks cannot be measured in the controlled settings of RCTs for example when benefit is improved compliance or adherence [22, 23], maintenance of effect, or when the benefit extends beyond individual patient level [16, 17, 18, 19].

These challenges may delay access to beneficial interventions for patients with unmet needs or discourage development programs due to excessively high operational costs. To overcome these limitations, trial designs aim to minimize duration of follow‐up for efficacy endpoints and rely on post‐ approval studies for long‐term safety monitoring [22]. Similarly, conditional approvals based on changes in biomarkers or other intermediate outcomes require confirmation in long‐term studies of the primary clinical endpoint [24].

3. RWD, Their Utility and Their Limitations

RWD increasingly inform pipeline and clinical development programs, including the target product profile, natural history of proposed indications, optimal clinical trial eligibility criteria, site selection, endpoint definitions, and sample size estimation [25]. RWD studies frequently inform regulatory agency and payer questions concerning post‐approval product safety and/or effectiveness and a variety of other evidence requirements [26]. RWE studies, when appropriately designed and implemented, offer relatively stronger external validity than RCTs because any participant in a healthcare system has the potential to contribute data, in contrast to RCTs, which represent only the subset of subjects who are willing to participate and meet the often‐strict eligibility criteria [19, 20, 21]. Nonetheless, important limitations of RWD studies arise from the inability to randomize and lack of protocol‐driven data collection.

The lack of protocol‐driven data collection, and the secondary use of data sources often purpose‐built for administrative and billing purposes, means that RWD is generally of lower quality than RCT, with higher rates of misclassification, mismeasurement, and missingness [27, 28, 29]. In the real‐world, patient care is delivered across many settings, not all of which may be captured in a single data source. In particular, when derived from fragmented healthcare systems where no central repository for data exists. In the US, for example, a patient may switch providers, healthcare plans, or insurers one or more times over the course of several years, leading to a fragmented record of the patient's interaction with a given health system or insurer feeding into a data source. Similarly, specialist interaction datasets may contain rich clinical information on the condition of interest but lack any information about other care received; or a single database may include, for example, outpatient (but not inpatient) medical or pharmacy (and not medical) claims.

The inherent inability to assign patients to a treatment group in observational studies introduces potential confounding due to certain patients being channeled towards certain treatments or by data quality differing between groups of interest. Often, propensity score methods meant to mimic randomization or control for potential measurable confounders aim to overcome this limitation in the analysis of a particular research question, but this may not always overcome the bias [30].

To overcome these limitations of RWD, frameworks have been developed, for example, target trial, to identify a combination of data source, study design, and analytic methods that together are capable of providing robust RWE or to determine when a primary data collection or interventional study is more appropriate [27, 28, 29, 30, 31].

4. Privacy‐Preserving Record Linkage

To address the issue of fragmented patient health data, methods exist to link an individual's health records across multiple datasets. These methods are often called privacy‐preserving record linkage (PPRL), tokenization, or identity resolution. PPRL works by having data stewards create coded representations of unique individuals using techniques that do not reveal personally identifiable information (PII) like names and addresses [29]. These coded representations, sometimes called “tokens,” enable matching of an individual's records across disparate data sources—such as patient‐level information from interventional studies (RCTs), insurance claims, healthcare systems, laboratory services, and state registries—without risking re‐identification or PII disclosure [29]. Figure 1 presents a simplistic diagram of the PPRL workflow. The key benefit is creating comprehensive patient health records without sharing personal details, thus preventing privacy violations and maintaining public trust [27].

Figure 1.

Figure 1

Generic Privacy‐Preserving Record Linkage (PPRL) Workflow.

PPRL encompasses techniques adapting to specific data and privacy needs, primarily falling into two fundamental approaches: deterministic and probabilistic linkage [32, 33, 34]. Deterministic PPRL relies on exact matches of identifiers, often employing cryptographic hash functions (e.g., HMAC‐SHA2‐256) to generate irreversible tokens from personal data combinations; a match occurs only if tokens are identical. While offering simplicity and high precision for quality data, this method is sensitive to minor variations or errors, potentially reducing recall. Commercial examples include proprietary tokenization by Datavant and secure hashing by Senzing [35]. Probabilistic PPRL, in contrast, uses statistical methods to assess match likelihood despite discrepancies or missing data, enhancing error tolerance and recall, especially with imperfect real‐world data. A prominent technique involves Bloom filters (BFs), space‐efficient probabilistic structures encoding identifiers into bit arrays via multiple hash functions, enabling similarity estimation (e.g., via Dice/Jaccard indices) between records [36]. Cryptographic Long‐Term Keys (CLKs) represent a specific composite BF implementation [37]. Although effective, BF‐based methods can yield false positives; countermeasures like hardened BFs and differential privacy aim to bolster privacy against cryptanalysis. Advanced cryptographic options like Secure Multiparty Computation (SMC) and Homomorphic Encryption (HE) permit direct computation on encrypted data, providing strong privacy guarantees but typically at higher computational costs [34, 38].

PPRL methods offer a significant advantage in privacy protection compared to direct data sharing or linkage using raw protected health information (PHI). By transforming identifying information using the techniques described above (such as irreversible tokens, Bloom filters, or secure multi‐party computation), PPRL minimizes the transfer and exposure of sensitive personal data between data custodians. Alternatives like simple de‐identification (e.g., removal of direct identifiers as per HIPAA Safe Harbor) may still carry a risk of re‐identification through inference or linkage with other available datasets. PPRL, especially when employing advanced techniques and robust security measures, aims to achieve a higher level of privacy preservation.

Beyond the technical aspects and immediate risk of re‐identification, several broader ethical and governance considerations are paramount when linking RCT and RWD via PPRL. Issues of data ownership need careful consideration, particularly regarding who has the right to control and use the linked data derived from different sources. Establishing clear data governance frameworks is crucial to define permissible use of the linked data, ensure accountability, and maintain data security throughout the linkage and analysis process. These frameworks should address aspects like data access controls, data retention policies, and procedures for handling data breaches.

Several initiatives demonstrate the successful implementation of PPRL with strong ethical safeguards. For instance, public health surveillance systems have utilized PPRL to link deidentified records while adhering to strict privacy protocols. Similarly, collaborative research networks have employed PPRL to link patient data across different healthcare organizations for research purposes under established ethical guidelines and data use agreements. These examples often involve transparent consent processes, independent oversight, and rigorous data security measures to protect patient privacy throughout the data linkage and analysis lifecycle.

Several organizations that manage large datasets, known as data custodians, now offer commercially available and centralized “marketplaces” where multiple databases are linked using PPRL. These linked databases serve as a comprehensive data source for health research [39, 40, 41]. However, it's important to note that the quality of these linkages, based on, for example, validity metrics such as specificity, sensitivity, and positive predictive value, can vary [42, 43]. For example, a study by Mirel et al. (2022) demonstrated that when linking RWD from the 2016 National Hospital Care Survey (NHCS) with the 2016/17 National Death Index (NDI) using PPRL, the precision (specificity) of the linkage ranged from 93.8% to 98.9% depending on the selection of tokens or patient identifiers; the sensitivity ranged between 98.7% and 97.8% [44]. The use of these linked data sources was particularly valuable during the COVID‐19 pandemic, allowing researchers to combine information from various open claims (i.e., from claim processing centers) and closed (i.e., from healthcare insurers or providers) claims data sources as well as laboratory and hospital chargemaster data, to gain a more complete understanding of patient's healthcare experiences [45, 46, 47, 48] based on secondary use of data with PPRL. Organizations using PPRL methods must strictly adhere to patient privacy laws to maintain public trust in the research process. Depending on the data sources being linked, obtaining patient consent may be necessary.

5. Potential Applications of Linked Clinical Trial Data and Real‐World Data

Combining clinical trial data with RWD via PPRL is a potential means of enriching data sources to benefit clinical development programs and researchers including pharmacoepidemiologists, health economists, or health services researchers. One application is to expand the window into patient health history, increasing “observability” of trial participants capturing data collected outside of traditional clinical site visits, which may include information collected at other care delivery sites or prior to or beyond the clinical trial timeline. A second is to enable comparisons of the quality, validity, and reliability of measures taken by site investigators and what is contained in RWD. A third potential application is the creation of large repositories of patient data combining RCT and RWD across a patient's longitudinal health history, such as a disease or treatment registry as a source for future studies. Some potential applications related to these benefits are described in Table 1 and a comparison of strengths and weaknesses of RCTs, RWD and linking are provided in Table 2.

Table 1.

Potential research opportunities informed through linkage of clinical trial data and real‐world data.

Benefit of PPRL Application of PPRL enabled RCT‐RWD combination Research objectives Rationale for PPRL Examples from the literature
Expanding observability of a patient's interaction with the healthcare system Long‐term follow‐up of clinical trial participants Assessing comparative effectiveness and/or safety; healthcare resource utilization or costs; PROs; or other long‐term outcomes Patient loss to follow‐up and the cost of conducting clinical trials often hinder long‐term outcomes assessment. Using RWD in the same patient population can help to understand the long‐term effects of treatment and minimize the need for direct contact with the participants after the completion of the trial treatment period. In the 2020 Janssen ENSEMBLE RCT of the Ad26. COV2.S COVID‐19 vaccine “addition of the utilization of tokenization and matching procedures to obtain medical data 5 years before enrollment of the participant until 5 years after the participant completed the study from consenting participants in the US” [49]. In that study, PPRL allowed for extending the observability of the health records before and outside of study sites, enabling additional baseline and follow‐up information to support the clinical development of the vaccine. Furthermore, safety and effectiveness outcomes assessment could be performed outside of the usual trial data collection using patient electronic health or claim records.
Evaluate safety and effectiveness of an interventional treatment relative to a RWD‐based external control arm comparisons Assessing comparative effectiveness and/or safety; healthcare resource utilization or costs; PROs; or other long‐term outcomes within an external control group Comparison of outcomes or other measures of an intervention's safety and effectiveness are impacted by the method of their collection or measurement. Since externally controlled trials mix RCT and RWD, PPRL can allow for comparisons between the RCT and RWD population using the same RWD measure, definitions, reliability, and validity.
Match trial participants to external real‐world data to support externally controlled trials

Comparable measurement of potential confounders

De‐duplication of patient data across multiple external data sources

Increasing statistical power by providing external controls from existing trials or RWD to underpowered studies

As with outcomes, methods for assessing baseline health status at enrollment varies between RCT and RWD populations. PPRL allows both treatment and external control arm to be matched on potential confounders assessed using measures from the same source (RWD).

Especially in the context of rare diseases, external controls may be hard to identify and matching clinical trial participants with such rare cases could potentially match a patient with themselves. Therefore, PPRL could be used to remove the RCT participant from the pool of eligible controls before matching.

In the POSITIVE SAT (clinicaltrials.gov number: NCT02308085), Patridge et al. (2023) analyzed data from women in the SAT compared with those in an external control cohort consisting of two previous trials [Suppression of Ovarian Function Trial (SOFT) and Tamoxifen and Exemestane Trial (TEXT)] who would have met the entry criteria for the current trial. The additional clinical trials as a source for external control allowed the researchers to contextualize their SAT findings but the use of PPRL for unifying different clinical trials was also raised at the National Institute on Aging (NIA)'s “Gaps and Opportunities for Real‐World Data Infrastructure” where they discussed their harmonization of data across more than 2,500 clinical trials and the implications for using PPRL therein [50]. Use of PPRL in this context could remove duplicate patients across trials or ensure they are only used as an external control only once, enabling their RCT experience to serve further research in the disease area. The use of placebo‐treated patients as external controls when RCTs are assessed in the same population or indication could increase statistical power for under‐powered RCTs.
Design pragmatic/hybrid clinical trials and to support post‐ marketing requirements or commitments Evaluate real‐world safety and effectiveness of an assigned intervention In real‐world care may be fragmented across multiple care systems. Linking multiple sources allows for a more comprehensive view of patient records. If the research objective is specific to a given system/payer, PPRL allows restriction of samples to those covered in the system. This could also reduce trial costs associated with using standard electronic capture forms (eCRFs), reduce the need for data management and complex equipment such as tablets, and open an avenue for novel mechanisms for data collection. The VERVE trial investigating the safety and effectiveness of varicella zoster vaccine in patients receiving TNF inhibitors, required two in‐person visits and intends pragmatic long‐term follow up through linked claims data [51].
Conduct post‐hoc analyses of clinical trial outcomes Identify baseline and time‐varying predictors of safety and efficacy among patients assigned to a treatment intervention Primary data collection studies must consider balance between quality and parsimony during study design. PPRL‐linked RWD can provide information about emergent health trends or variables not considered a priori at time of clinical study design. No examples found.
Study relationships between RCT events, such as patient reported experience measures (PREMs), patient reported outcome measures (PROs), and interim and long‐term real‐ world clinical outcomes Assessing efficacy/tolerability sustained after trial (duration of response, less controlled settings) in relation to patient reported metrics PPRL enables the connection of a patient's RCT experience with their RWD. This opens the path for new research questions regarding the patient's experience during the trial and their health outcomes long after the trial has ended. This could identify new associations or generate new hypotheses about the relationship between trial exposures and endpoints or experiences and long‐ term patient experiences. The Australian Collaboration for Coordinated Enhanced Sentinel Surveillance of Sexually Transmissible Infections and Blood‐borne Viruses (ACCESS) uses PPRL to connect de‐ identified patient records to conduct public health surveillance, specifically targeting blood‐borne viruses and sexually transmissible infections [52].
Understand more about a patient's past, and their generalizability to real‐world populations Transportability, extrapolate efficacy estimate to real world; uses more inputs to project Linked RWD data provides more inputs in addition to information collected in trial that can be used more accurately to project the effects of treatment in the broader indicated population to inform payers, HTA of expected benefit/impact. No examples found.
Address missing data and drop‐outs in RCT data based on RWD

Validating and supplementing baseline characteristics of RCT participants by comparison to pre‐ baseline RWD.

Looking at follow‐up RWD to understand drop‐out patients

PPRL can support RCTs to “fill in the gaps” of baseline medical history (e.g., due to recall bias or a lack of knowledge) by examining events occurring in RWD. Additionally, RWD can validate surrogate endpoints measured in the interventional study or fill in gaps to understand missing data or loss to follow‐up. No examples found in clinical trials, but frequently used to combine with primary data collection registries [53].
Comparing RWD and RCT data for the same patients Algorithmic patient identification for trial recruitment, registry inclusion, super/non‐responders, or high‐risk subgroups, and/or under‐ surveillance Increasing trial success and improving patient representation Building and testing algorithms (e.g., with a small training data set of case patients and using machine learning or other data science methods) for identifying patients to target for trial recruitment in larger registries or RWD sources such as EHR systems. This could include identifying those patients most likely to respond to treatment, least likely to demonstrate adverse events, or even those who may be un‐ or misdiagnosed and may be missed in recruitment efforts. PPRL allows for use of trial participants' RWD‐based measures at baseline to ensure comparable measurement. Haynes et al. engaged a payer‐stakeholder data repository, the HealthCore Integrated Research Environment, to conduct population‐based patient identification of potential research participants [54] with mail‐based invitations outperforming emailed invitations albeit with modest effectiveness.
Validate baseline demographic characteristics of clinical trial participants in comparison to the same or a similar time period using RWD. Understanding the accuracy and validity of baseline demographics and identifying any discrepancies or biases in the data. PPRL with RWD can provide more comprehensive and up‐to‐date information on demographic characteristics, which may not have been captured in the limited time frame of RCTs. Researchers from the PCORnet further used PPRL to identify the overlap across approximately 170 million patient records and created a de‐ duplicated summary of demographic and clinical characteristics for patients with a visit in 2018 or 2019 using data from 61 Network Partners [55].
Validate real‐world algorithms or endpoints with RCT as a “gold standard” How well does algorithm, definition for real‐world variable/endpoint agree with measurement from primary data collection in RCT PPRL can help validate RWD measures or RWD‐ based algorithms by comparison with gold standard RCTs. For example, validating proxy algorithms for measures not typically recorded in RWD (e.g., ejection fraction, disease severity or progression of disease), or filling in gaps in RWD for care received during the course of RCT, which are often not observable in RWD. In a study presented by the National Patient‐ Centered Clinical Research Network (PCORnet), Haynes et al. assessed the feasibility of linking and using patient‐powered research network (PPRN) membership and health plan data to confirm self‐ reported diagnosis [54]. In doing so, they were able to estimate the agreement between patient self‐ reported disease status as indicated through PPRN membership and health plan administrative record of disease status.
Identify threats to the validity of trials from potential bias Using the RWD of RCT participants to look for unmeasured confounders in small samples or reduce the impact of recall bias when assessing baseline history Randomization may not be achieved in small samples. PPRL RWD contains more variables than may be assessed at baseline in RCT to inspect for imbalances. In the 2020 Janssen ENSEMBLE RCT of the Ad26. COV2.S COVID‐19 vaccine “addition of the utilization of tokenization and matching procedures to obtain medical data 5 years before enrollment of the participant” [49]. This extension of the baseline assessment of characteristics allows researchers to reduce the impact of potential participant recall bias and more completely capture their medical history.
Creating new uses for RWD and RCT by combining them Creation of patient registries PROs, PREMs, prospective data linked with claims, EHR, multiple clinical trials Aggregated data across multiple settings enriches information available to researchers, particularly in rare disease settings. The TREAT‐NMD Neuromuscular registry network (TNMD) uses a federated network model to identify patients with spinal muscular atrophy (SMA) to provide data for post‐marketing effectiveness studies, natural history, validation of outcome measures, and clinical trial control arm data [56]. For each of these use cases, TNMD utilizes PPRL to de‐duplicate, capture migration across registries, and combine patient data into a centralized data warehouse (CDW). Furthermore, TNMD received EMA endorsement/support in December 2022 for use in generating RWE in this rare disease population [57].

Table 2.

High‐level comparison of RCTs, RWD, and Linked RCT + RWD via PPRL.

Feature Randomized controlled trials (RCTs) Real‐world data (RWD) Linked RCT + RWD via PPRL
Internal validity High (due to randomization and controlled conditions) Lower (susceptible to confounding and selection bias) Can be high for trial outcomes; external validity of RWD outcomes needs careful assessment
External validity Often limited by strict inclusion/exclusion criteria Generally higher, reflecting diverse patient populations Potential to bridge the gap by assessing trial generalizability in RWD
Cost Typically high due to intensive data collection and monitoring Generally lower for existing data sources Adds cost for PPRL implementation but can potentially reduce costs in other areas (e.g., longer follow‐up)
Timelines Can be lengthy (recruitment, follow‐up) Data often readily available (historical data) Linkage process adds time; can accelerate long‐term follow‐up insights
Completeness of data Typically high for pre‐defined trial variables Variable; can suffer from missing data and fragmentation Potential to improve completeness for trial participants (e.g., outcomes, comorbidities)
Bias potential Risk of selection bias in participation, attrition bias Selection bias in data capture, information bias (misclassification) Can mitigate some biases (e.g., LTFU) but introduce linkage‐related biases (false positives/negatives)
Privacy considerations Primarily around trial participant data during the trial Concerns around secondary use and potential re‐identification Addresses privacy by linking without direct PII sharing; residual risks remain
Regulatory use Gold standard for efficacy and safety Increasing acceptance for contextual evidence, post‐marketing Potential to strengthen RWE for regulatory purposes (e.g., ECAs, long‐term safety)

While linked data sources proved invaluable during the COVID‐19 pandemic by stitching together fragmented patient experiences, the utility of PPRL extends across therapeutic areas and research questions. For instance, PPRL can facilitate the identification of long‐term outcomes that may not be captured within the standard follow‐up period of an RCT. By linking RCT participants to longitudinal RWD sources like electronic health records or administrative claims data, researchers can track the occurrence of events of interest years after the trial concludes. This extended follow‐up can be particularly critical for understanding the durability of treatment effects or identifying rare but serious long‐term adverse events that might not emerge during the controlled trial setting.

Furthermore, PPRL can be instrumental in validating RWD‐derived endpoints against the ‘gold standard’ measurements obtained in RCTs. For example, a study might use PPRL to link patients who participated in a clinical trial where a specific disease severity score was meticulously recorded to their corresponding RWD records. By comparing the trial‐derived score to algorithms or proxy measures for disease severity calculated from the RWD (e.g., based on diagnostic codes, medication use, or healthcare utilization), researchers can assess the accuracy and reliability of these RWD‐based endpoints for use in broader observational studies. Haynes et al. (2023) described an effort to link patient‐powered research network membership data with health plan data using PPRL to confirm self‐reported diagnoses, demonstrating the feasibility of validating RWD elements against patient‐reported information, which can be considered a form of ‘gold standard’ in certain contexts [58]. This process helps establish the fitness‐for‐purpose of RWD for specific research objectives.

In the realm of rare diseases, where patient populations are small and RCTs can be challenging, PPRL enables the creation of comprehensive patient registries by linking data from multiple sources, including clinical trials, natural history studies, and electronic health records. The TREAT‐NMD Neuromuscular registry network (TNMD) serves as a prime example, utilizing PPRL to de‐duplicate patient records, track migration across registries, and aggregate data into a centralized data warehouse for spinal muscular atrophy (SMA) [56, 57]. This linked data then supports various research activities, including post‐marketing effectiveness studies and the generation of external control arm data for clinical trials, enhancing the evidence base for regulatory decision‐making in this rare disease population [57].

The use of PPRL can also improve the robustness of externally controlled trials by allowing researchers to remove duplicate patients across different clinical trial datasets used as external controls or to ensure that a patient's RCT experience is only counted once when contributing to the external control arm. The National Institute on Aging (NIA) has discussed the harmonization of data across thousands of clinical trials using PPRL [35], highlighting its potential to ensure the unique contribution of each patient's data in comparative analyses.

6. Challenges in Using the RWD of Clinical Trial Participants Through Privacy‐Preserving Record Linkage

Combining data from randomized clinical trials (RCTs) with real‐world data (RWD) through PPRL introduces significant practical and ethical considerations, particularly concerning consent and privacy. Obtaining participant consent before linking these data sources is currently a standard ethical practice for RCT participants [59, 60]. However, given the costs and difficulties associated with recruiting participants for studies, researchers must carefully strategize their approach to obtaining this consent. Factors to consider include the target population, the therapeutic area, and the study's duration to minimize participant attrition. Clinical trial operations teams also need to decide on crucial aspects of the consent process, such as whether to limit its duration, restrict it to specific uses, obtain it at enrollment or a later visit, and whether participants will actively opt‐in or opt‐out of the tokenization process. Operationally, systems must be in place to ensure that linked data can be separated if a participant withdraws their consent.

Even when data from RCTs and RWD are anonymized or pseudonymized, linking these two sources introduces a potential risk of re‐identifying individuals. To mitigate this risk, a process of privacy recertification of the linked data set may be required. This recertification could necessitate omitting certain key variables, such as race, geographic location, or information about rare events. Therefore, the research team and the entity responsible for recertification must collaboratively decide which variables can be omitted while still supporting the research aims and adequately protecting patient privacy.

A critical challenge in utilizing RWD, and consequently in linking it with RCT data via PPRL, lies in the inherent data quality issues prevalent in many real‐world sources. RWD, derived from electronic health records, administrative claims, patient registries, and wearable devices, can suffer from missing data, misclassification of diagnoses and procedures, and inconsistencies in recording practices. These underlying data quality problems are not inherently resolved by the PPRL process and can significantly impact the validity of analyses conducted on the linked data set.

Furthermore, the linkage process itself can introduce or exacerbate biases. Selection bias may arise if PPRL fails to link certain trial participants to their RWD records, and these unlinked individuals differ systematically from those who are successfully linked. For example, differential linkage rates based on demographic factors or the quality of available personally identifiable information (PII) could lead to a linked cohort that is not representative of the original trial population. Information bias, including misclassification, can be introduced through both false positive linkages (incorrectly linking records of different individuals, leading to erroneous covariate or outcome data) and false negative linkages (failing to link records of the same individual, resulting in incomplete data). The impact of confounding may also be affected; while linkage to richer RWD can provide more covariates for adjustment, linkage errors could lead to inaccurate control for confounders if linkage quality varies across levels of these confounders.

To ensure the reliability of linked datasets, robust validation frameworks are essential [55, 61, 62]. This involves comparing the links generated by PPRL algorithms against a ‘gold standard’ where true matches are known. Common methods for establishing a gold standard include manual review of a subset of potential links or deterministic linkage using unique identifiers when available. The performance of PPRL is then evaluated using metrics such as precision (the proportion of declared matches that are correct), recall (the proportion of true matches that are identified), and the F1‐score (the harmonic mean of precision and recall). Researchers must transparently report these validation metrics, along with details on the data sources, linkage methods, and any limitations, to allow for critical appraisal of the potential for bias in the linked data.

The potential applications of linked clinical trial and RWD to support regulatory decision‐making merit special considerations but are challenging to identify given the use of PPRL is an evolving and relatively new area. Notwithstanding the considerations that apply to non‐randomized externally controlled trials, the assessment of the appropriateness of study designs that include RCT and RWD linked via PPRL in studies intended to inform regulatory decision making requires additional rigor; similar considerations apply to non‐regulatory settings as appropriate for the level of evidence required.

The challenges of determining and documenting the relevance of the data sources and data capture for a given objective increase if the PPRL‐combined data set is derived from multiple sources given the numerous considerations for even a single data source. The FDA guidance on reliability and relevance stipulates that a sponsor should describe each data source, the information obtained, linkage methods, accuracy, and completeness of record linkages over time.

Similarly, multiple record linkages increase the challenges associated with demonstrating the provenance, quality, and completeness of the RWD sources with respect to key data elements such as inclusion criteria, exposure(s), key covariates, outcome(s) of interest, and other important parameters relevant to the study question and design. For studies combining data from multiple data sources the FDA further recommends demonstrating whether and how data from different sources can be obtained and integrated with acceptable quality, given the potential for heterogeneity in population characteristics, clinical practices, and coding across data sources [7].

Geographic representation poses a particular challenge when relying on PPRL of clinical trial participants. PPRL services are more prominent in the US, perhaps due to fewer privacy regulations and lack of a national health identifier that permits patient follow‐up over the life course. One data custodian recently announced plans to offer PPRL services in the UK [41]. In 2021, the European Parliamentary Research Service (EPRS) specifically named PPRL as a component to a centralized governance structure for an EU‐wide health care data set but noted it must not bypass citizen participation to ensure accountability to citizens, to secure their participatory and legitimizing democratic support. This may delay PPRL adoption in the EU until this process takes place. In addition, despite methodologic improvements currently being developed specifically for a European context [63], we could not identify specific data custodians offering PPRL services in Europe. The feasibility of PPRL of RCT and RWD will vary within and across countries depending on laws and cultural attitudes towards privacy, informed consent, and health research.

The practical implementation of PPRL for linking RCT and RWD presents several technical and logistical challenges. Technical feasibility involves considerations such as the computational resources required for token generation and record matching, particularly when dealing with large datasets. The scalability of PPRL methods is crucial for handling the increasing volumes of RWD and the need to link across multiple sources. Achieving interoperability between different data systems with varying data formats, terminologies, and standards is a significant hurdle that necessitates data standardization and harmonization efforts before linkage.

The cost‐effectiveness of PPRL must be considered in relation to the potential benefits and alternatives. While there are costs associated with engaging PPRL vendors, implementing necessary infrastructure, and managing the linkage process, these expenses may be offset by the potential savings from reduced reliance on costly primary data collection, extended trial follow‐up without direct patient contact, and the generation of more comprehensive evidence to support regulatory and payer decisions. For example, leveraging RWD as external control arms in certain scenarios could reduce the sample size and duration of RCTs, leading to significant cost savings.

Legal constraints extend beyond the primary concern of patient privacy regulations like GDPR and HIPAA. Data use agreements between the RCT sponsor and RWD custodians, as well as approvals from Institutional Review Boards (IRBs) or ethics committees, are essential legal prerequisites for any data linkage initiative. These agreements must clearly define the purpose of the linkage, the data elements to be used, the privacy safeguards in place, and the permitted analyses of the linked data. Navigating the complexities of these legal and administrative hurdles across different data custodians and jurisdictions can be a significant implementation challenge.

Furthermore, differing data privacy regulations across jurisdictions create practical barriers to PPRL implementation. For instance, the General Data Protection Regulation (GDPR) in the European Union generally requires explicit consent for processing sensitive health data, including for research purposes. This stricter consent requirement can impact the feasibility of linking data, especially for retrospective studies or when explicit consent for linkage was not obtained during the initial data collection. In contrast, the HIPAA in the United States provides pathways for using PII for research under certain conditions, including de‐identification standards and waivers of authorization approved by Institutional Review Boards (IRBs). These regulatory differences necessitate careful consideration of legal and ethical frameworks when planning and executing PPRL projects across different regions.

Furthermore, the successful adoption of PPRL relies heavily on maintaining the “social license” – the public's trust and acceptance of large‐scale data linkage initiatives for research purposes [64]. Transparency regarding the purpose of data linkage, the privacy‐preserving techniques employed, and the safeguards in place to prevent re‐identification and misuse is crucial for fostering this trust. Any perceived breaches of privacy or lack of transparency can erode public support and hinder the broader adoption of PPRL in health research. Ensuring accountability to citizens and addressing participatory democratic support, as noted by the European Parliamentary Research Service (EPRS), is vital for building and sustaining this social license, particularly in regions with strong emphasis on data privacy rights.

While empirical data from surveys or interviews could provide valuable insights into public perceptions and the practical challenges of implementation, such primary data collection falls outside the scope of this narrative review. However, these aspects represent important areas for future research and consideration in the broader field of PPRL application.

While PPRL offers numerous potential benefits for linking RCT and RWD, it is crucial to acknowledge its limitations and scenarios where it may not be the optimal approach or could introduce complexities. PPRL is fundamentally dependent on the quality and completeness of the PII used for linkage; if these are poor in either the RCT or RWD source, linkage accuracy will be compromised, potentially leading to biased results. In situations where the RWD source lacks the key variables needed for a specific research question or where the overlap between the trial population and the RWD source is minimal, the value of linkage will be limited. Furthermore, if the quality of linkage is poor or differential between exposure groups, PPRL could inadvertently increase confounding by introducing misclassified data into the analysis. Documented challenges in PPRL studies include instances of lower‐than‐expected linkage rates in certain subpopulations due to data quality issues or variations in naming conventions. Therefore, researchers must carefully assess the suitability of PPRL for their specific research objectives and data sources, and transparently report any limitations encountered.

A final consideration is the extent to which PPRL studies require additional assessment and quantification of potential selection biases in this setting. To this end, researchers should compare characteristics of trial participants who do and do not consent to linkage. Further research into the challenges of linking RWD and RCT data through PPRL and documentation of the lessons learned from researchers performing such linkage will advance science as well as future studies combining RCT and RWD.

7. Conclusion

While the process to combine RCT and RWD sources on the same individuals is well‐accepted and easily accomplished, there still remains unrealized potential to add significant value to drug, vaccine, medical device, and diagnostics development by lowering costs, increasing the availability of more comprehensive patient data, and advancing approaches and methods for evidence generation. The data collection methods for RCTs and RWE, for example, primary versus secondary data collection, drive the measurement characteristics, including sensitivity, specificity, and missingness of observations, mean difference, and onset timing that often limit evidence generation and decision‐making. Linking RCT participant's data with their RWD may solve problems that neither RWD nor RCTs alone could overcome. Further consideration of PPRL use cases in clinical development may drive innovative trial designs augmented with RWD and vice versa, improving the ability of this collected data to support informed decision‐making and clinical development across stakeholders.

Author Contributions

Michael Batech: conceptualization, methodology, writing – original draft. Ann Madsen: conceptualization, writing – review and editing, supervision. Nicolle Gatto: methodology, conceptualization, supervision, writing – review and editing. Tancy C. Zhang: methodology, writing – original draft, writing – review and editing. Deborah Ricci: conceptualization, writing – review and editing. Raymond Harvey: conceptualization, methodology, writing – review and editing, supervision. Najat Khan: funding acquisition, writing – review and editing. Sid Jain: writing – review and editing, conceptualization, resources.

Ethics Statement

This article is a narrative review and does not contain any studies with human participants or animals performed by any of the authors.

Conflicts of Interest

Dr. Michael Batech, Dr. Ann Madsen, Dr. Nicolle M. Gatto, and Mrs. Tancy C. Zhang are employees of Aetion Inc. and own stock options or equity in Aetion. Dr. Deborah Ricci, Mr. Ray Harvey, Dr. Najat Khan, and Mr. Sid Jain are employees and shareholders of Johnson & Johnson. No honoraria were made for authorship. The lead author, Dr. Michael Batech, affirms that this manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted.

Acknowledgments

We would like to acknowledge and thank Douglas J Watson, PhD, of Epi Excellence LLC and PharmaEpi Consulting LLC for initial medical writing support and thought leadership which helped shape the direction of our work; to Sebastian Schneeweiss for insightful thought leadership and perspective on PPRL, especially in overcoming limitations in RWE generation; and Alexa Rubens, Jennifer Thorburn, and Hannah Kreisberg from Aetion for their support in generating real‐world evidence using PPRL linked data sets which provided valuable lessons learned early on in the development of this manuscript. This study was funded by Janssen Research and Development. The funding organization collaborated in the design and conduct of the study, the review of the manuscript and the preparation, and decision to submit the manuscript for publication. The funder had no role in the collection, management, analysis, and interpretation of the data.

Batech M., Madsen A., Gatto N., et al., “Combining Real‐World and Clinical Trial Data Through Privacy‐Preserving Record Linkage: Opportunities and Challenges—A Narrative Review,” Health Science Reports 8 (2025): 1‐15, 10.1002/hsr2.71272.

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study. This manuscript is a narrative review and did not involve the collection or analysis of primary research data. All information and claims presented in this review are based on the publicly available literature cited in the References section.

References

  • 1. Gabay M., “21st Century Cures Act,” Hospital Pharmacy 52, no. 4 (2017): 264–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Hudson K. L. and Collins F. S., “The 21st Century Cures Act—A View From the NIH,” New England Journal of Medicine 376, no. 2 (2017): 111–113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Makady A., de Boer A., Hillege H., Klungel O., and Goettsch W., “What Is Real‐World Data? A Review of Definitions Based on Literature and Stakeholder Interviews,” Value in Health 20, no. 7 (2017): 858–865. [DOI] [PubMed] [Google Scholar]
  • 4. Purpura C. A., Garry E. M., Honig N., Case A., and Rassen J. A., “The Role of Real‐World Evidence in FDA‐Approved New Drug and Biologics License Applications,” Clinical Pharmacology and Therapeutics 111, no. 1 (2022): 135–144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Food U. and Administration D. Framework for FDA's Real‐world Evidence Program. Published December 2018.
  • 6. Cucherat M., Laporte S., Delaitre O., et al., “From Single‐Arm Studies to Externally Controlled Studies. Methodological Considerations and Guidelines,” Therapies 75, no. 1 (2020): 21–27. [DOI] [PubMed] [Google Scholar]
  • 7. Real‐World F. D. A. U., Data: Assessing Registries To Support Regulatory Decision‐Making for Drug and Biological Products, eds. Research Cf. D. Ea. R. Cf. B. Ea and Excellence O. Co. FDA, 2023). [Google Scholar]
  • 8. Anes A. M., Arana A., Blake K., et al., The European Network of Centres for Pharmacoepidemiology and Pharmacovigilance (ENCePP). Guide on Methodological Standards in Pharmacoepidemiology (revision 1, 2012, revision 2, 2013, revision 3, 2014). 2012.
  • 9. EMA . Guideline on Registry‐based Studies. European Medicines Agency Amsterdam, 2021.
  • 10. Health NIf, Excellence C . NICE Real‐world Evidence Framework: Corporate Document [ECD9]. 2022.
  • 11. Munnangi S. and Boktor S. W. Epidemiology of Study Design. 2017. [PubMed]
  • 12. Cartwright N., “A Philosopher's View of the Long Road From RCTs to Effectiveness,” Lancet 377, no. 9775 (2011): 1400–1401. [DOI] [PubMed] [Google Scholar]
  • 13. Fogel D. B., “Factors Associated With Clinical Trials That Fail and Opportunities for Improving the Likelihood of Success: A Review,” Contemporary Clinical Trials Communications 11 (2018): 156–164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Kadam R., Borde S., Madas S., Salvi S., and Limaye S., “Challenges in Recruitment and Retention of Clinical Trial Subjects,” Perspectives in Clinical Research 7, no. 3 (2016): 137–143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Nipp R. D., Hong K., and Paskett E. D., editors. Overcoming Barriers to Clinical Trial Enrollment. American Society of Clinical Oncology Educational Book American Society of Clinical Oncology Annual Meeting; 2019. [DOI] [PubMed]
  • 16. Akl E. A., Briel M., You J. J., et al., “Potential Impact on Estimated Treatment Effects of Information Lost to Follow‐up in Randomised Controlled Trials (LOST‐IT): Systematic Review,” BMJ 18 (2012): 344. [DOI] [PubMed] [Google Scholar]
  • 17. Frieden T. R., “Evidence for Health Decision Making—Beyond Randomized, Controlled Trials,” New England Journal of Medicine 377, no. 5 (2017): 465–475. [DOI] [PubMed] [Google Scholar]
  • 18. Higgins J. P., Savović J., Page M. J., Elbers R. G., and Sterne J. A., “Assessing Risk of Bias in a Randomized Trial,” Cochrane Handbook for Systematic Reviews of Interventions 1 (2019): 205–228. [Google Scholar]
  • 19. Rothwell P. M., “External Validity of Randomised Controlled Trials:“To Whom do the Results of This Trial Apply?”,” The Lancet 365, no. 9453 (2005): 82–93. [DOI] [PubMed] [Google Scholar]
  • 20. Bothwell L. E., Greene J. A., Podolsky S. H., and Jones D. S., “Assessing the Gold Standard—Lessons From the History of RCTs,” New England Journal of Medicine 374, no. 22 (2016): 2175–2181. [DOI] [PubMed] [Google Scholar]
  • 21. Chavez‐MacGregor M. and Giordano S. H. Randomized Clinical Trials and Observational Studies: Is There a Battle?: American Society of Clinical Oncology; 2016. p. 772–773. [DOI] [PubMed]
  • 22. Alphs L., Mao L., Lynn Starr H., and Benson C., “A Pragmatic Analysis Comparing Once‐Monthly Paliperidone Palmitate Versus Daily Oral Antipsychotic Treatment in Patients With Schizophrenia,” Schizophrenia Research 170, no. 2–3 (2016): 259–264. [DOI] [PubMed] [Google Scholar]
  • 23. Cohen S. B., Greenberg J. D., Harnett J., et al., “Real‐World Evidence to Contextualize Clinical Trial Results and Inform Regulatory Decisions: Tofacitinib Modified‐Release Once‐Daily vs. Immediate‐Release Twice‐Daily for Rheumatoid Arthritis,” Advances in Therapy 38 (2021): 226–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Mills M. and Kanavos P., “How do HTA Agencies Perceive Conditional Approval of Medicines? Evidence From England, Scotland, France and Canada,” Health Policy 126, no. 11 (2022): 1130–1143. [DOI] [PubMed] [Google Scholar]
  • 25. Dagenais S., Russo L., Madsen A., Webster J., and Becnel L., “Use of Real‐World Evidence to Drive Drug Development Strategy and Inform Clinical Trial Design,” Clinical Pharmacology and Therapeutics 111, no. 1 (2022): 77–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Suvarna V., “Phase IV of Drug Development,” Perspectives in Clinical Research 1, no. 2 (2010): 57–60. [PMC free article] [PubMed] [Google Scholar]
  • 27. Borfitz D., The Case For Tokenizing Data on Clinical Trial Participants Clinical Research News, 2022.
  • 28. Hernán M. A. and Robins J. M., “Using Big Data to Emulate a Target Trial When a Randomized Trial is Not Available,” American Journal of Epidemiology 183, no. 8 (2016): 758–764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Vaiwsri S., Ranbaduge T., Christen P., and Schnell R., “Accurate Privacy‐Preserving Record Linkage for Databases With Missing Values,” Information Systems 106 (2022): 101959. [Google Scholar]
  • 30. Guo S., Fraser M., and Chen Q., “Propensity Score Analysis: Recent Debate and Discussion,” Journal of the Society for Social Work and Research 11, no. 3 (2020): 463–482. [Google Scholar]
  • 31. Gatto N. M., Campbell U. B., Rubinstein E., et al., “The Structured Process to Identify Fit‐For‐Purpose Data: A Data Feasibility Assessment Framework,” Clinical Pharmacology and Therapeutics 111, no. 1 (2022): 122–134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Hayn D., Kreiner K., Sandner E., et al., “Use Cases Requiring Privacy‐Preserving Record Linkage in Paediatric Oncology,” Cancers 16, no. 15 (2024): 2696. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Pathak A., Serrer L., Zapata D., et al., “Privacy Preserving Record Linkage for Public Health Action: Opportunities and Challenges,” Journal of the American Medical Informatics Association 31, no. 11 (2024): 2605–2612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Zhu Y., Matsuyama Y., Ohashi Y., and Setoguchi S., “When to Conduct Probabilistic Linkage vs. Deterministic Linkage? A Simulation Study,” Journal of Biomedical Informatics 56 (2015): 80–86. [DOI] [PubMed] [Google Scholar]
  • 35. National Institute on Aging . Privacy Preserving Record Linkage (PPRL) Strategy and Recommendations. (2023), https://www.nia.nih.gov/sites/default/files/2023-08/pprl-linkage-strategies-preliminary-report.pdf.
  • 36. Schnell R., Richter A., and Borgs C., “A Comparison of Statistical Linkage Keys With Bloom Filter‐based Encryptions for Privacy‐preserving Record Linkage Using Real‐world Mammography Data.” Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies [Internet] (Porto, Portugal: SCITEPRESS ‐ Science and Technology Publications, 2017), 276–283, http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0006140302760283. [Google Scholar]
  • 37. Brown A. P., Borgs C., Randall S. M., and Schnell R., “Evaluating Privacy‐Preserving Record Linkage Using Cryptographic Long‐Term Keys and Multibit Trees on Large Medical Datasets,” BMC Medical Informatics and Decision Making 17, no. 1 (2017): 83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Mağara Ş. S., Dietrich N., Ünal A. B., and Akgün M. Accelerating Privacy‐Preserving Medical Record Linkage: A Three‐Party MPC Approach [Internet]. arXiv; 2024. [cited 2025 April 18], https://arxiv.org/abs/2410.21605.
  • 39.datavant. Matching Patients Across Healthcare Databases, https://www.datavant.com/white-papers/matching-patients-across-healthcare-databases.
  • 40.datavant. Datavant Launches International Expansion, Acquires Convenet to Build Trial Tokenization in the UK 2022, https://www.datavant.com/press-release/datavant-launches-international-expansion-acquires-convenet-to-build-trial-tokenization-in-the-uk.
  • 41.HealthVerity. HealthVerity Census De‐identification and Identity Matching Software, https://healthverity.com/solutions–trashed/healthverity-census/.
  • 42. Bernstam E. V., Applegate R. J., Yu A., et al., “Real‐World Matching Performance of Deidentified Record‐Linking Tokens,” Applied Clinical Informatics 13, no. 4 (2022): 865–873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Eckrote M. J., Nielson C. M., Lu M., et al., “Linking Clinical Trial Participants to Their U.S. Real‐World Data Through Tokenization: A Practical Guide,” Contemporary Clinical Trials Communications 41 (2024): 101354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Mirel L. B., Resnick D. M., Aram J., and Cox C. S., “A Methodological Assessment of Privacy Preserving Record Linkage Using Survey and Administrative Data,” Statistical Journal of the IAOS 38, no. 2 (2022): 413–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Burn E., Sena A. G., Prats‐Uribe A., et al., Use of Dialysis, Tracheostomy, and Extracorporeal Membrane Oxygenation Among 842,928 Patients Hospitalized With COVID‐19 in The United States. medRxiv. 2021.
  • 46. Harvey R. A., Rassen J. A., Kabelac C. A., et al., “Association of SARS‐CoV‐2 Seropositive Antibody Test With Risk of Future Infection,” JAMA Internal Medicine 181, no. 5 (2021): 672–679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Murk W., Gierada M., Fralick M., Weckstein A., Klesh R., and Rassen J. A., “Diagnosis‐Wide Analysis of COVID‐19 Complications: An Exposure‐Crossover Study,” Canadian Medical Association Journal 193, no. 1 (2021): E10–E18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Stewart M., Rodriguez‐Watson C., Albayrak A., et al., “COVID‐19 Evidence Accelerator: A Parallel Analysis to Describe the Use of Hydroxychloroquine With or Without Azithromycin Among Hospitalized COVID‐19 Patients,” PLoS One 16, no. 3 (2021): e0248128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Vaccines J., A Study of Ad26. COV2. S for the Prevention of SARS‐CoV‐2‐Mediated COVID‐19 in Adult Participants (ENSEMBLE) 3 ClinicalTrials,” 2022. gov/ct2/show/study/NCT04505722.
  • 50. (NIA) NIoA . Gaps and Opportunities for Real‐World Data Infrastructure 2022.
  • 51. Curtis J. R., Foster P. J., and Saag K. G., “Tools and Methods for Real‐World Evidence Generation,” Rheumatic Disease Clinics of North America 45, no. 2 (2019): 275–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Nguyen L., Stoové M., Boyle D., et al., “Privacy‐Preserving Record Linkage of Deidentified Records Within a Public Health Surveillance System: Evaluation Study,” Journal of Medical Internet Research 22, no. 6 (2020): e16757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Lecluse L. L. A., Naldi L., Stern R. S., and Spuls P. I., “National Registries of Systemic Treatment for Psoriasis and the European ‘Psonet'initiative,” Dermatology 218, no. 4 (2009): 347–356. [DOI] [PubMed] [Google Scholar]
  • 54. Haynes K., Agiro A., Chen X., et al., Developing Methods to Link Patient Records Across Data Sets That Preserve Patient Privacy. 2023. [PubMed]
  • 55. Marsolo K., Kiernan D., Toh S., et al., “Assessing the Impact of Privacy‐Preserving Record Linkage on Record Overlap and Patient Demographic and Clinical Characteristics in PCORnet®, The National Patient‐Centered Clinical Research Network,” Journal of the American Medical Informatics Association 30, no. 3 (2023): 447–455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Agency EM . TREAT‐NMD and EMA Registry Qualification.
  • 57. Limited T‐NS . TREAT‐NMD Core Dataset for SMA. TREAT‐NMD Neuromuscular Network.
  • 58. Haynes K., Agiro A., Chen X., et al., Developing Methods to Link Patient Records across Data Sets That Preserve Patient Privacy [Internet] (Washington (DC): Patient‐Centered Outcomes Research Institute (PCORI), 2020). (PCORI Final Research Reports), http://www.ncbi.nlm.nih.gov/books/NBK593580/. [PubMed] [Google Scholar]
  • 59. Biomedical USNCftPoHSo, Research B . The Belmont Report: Ethical Principles and Guidelines for the Protection Of Human Subjects Of Research: Department of Health, Education, and Welfare, National Commission for the 1978. [PubMed]
  • 60. Miracle V. A., “The Belmont Report: The Triple Crown of Research Ethics,” Dimensions of Critical Care Nursing 35, no. 4 (2016): 223–228. [DOI] [PubMed] [Google Scholar]
  • 61. Tachinardi U., Grannis S. J., Michael S. G., et al., “Privacy‐Preserving Record Linkage Across Disparate Institutions and Datasets to Enable a Learning Health System: The National COVID Cohort Collaborative (N3C) Experience,” Learn Health System 8, no. 1 (2024): e10404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Kiernan D., Carton T., Toh S., et al., “Establishing a Framework for Privacy‐Preserving Record Linkage Among Electronic Health Record and Administrative Claims Databases Within Pcornet®, The National Patient‐Centered Clinical Research Network,” BMC Research Notes 15, no. 1 (2022): 337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Laud P. and Pankova A., “Privacy‐Preserving Record Linkage in Large Databases Using Secure Multiparty Computation,” BMC Medical Genomics 11 (2018): 84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Muller S. H. A., Kalkman S., G. J. M. W. van Thiel, , Mostert M., and Van Delden J. J. M., “The Social Licence for Data‐Intensive Health Research: Towards Co‐Creation, Public Value and Trust,” BMC Medical Ethics 22, no. 1 (2021): 110. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study. This manuscript is a narrative review and did not involve the collection or analysis of primary research data. All information and claims presented in this review are based on the publicly available literature cited in the References section.


Articles from Health Science Reports are provided here courtesy of Wiley

RESOURCES