Skip to main content
. 2023 Apr 5;25:e43484. doi: 10.2196/43484

Table 1.

Summary of real-world data identification for comparative efficacy using externally controlled trial challenges, examples, and application of solutions.

Challenges Examples Application of solutions
Data source identification of rare conditions

Indecision gaps due to abundance of real-world data It is difficult to parse out important data sources, rare disease candidates, and data linkage options. Machine learning applications can improve accuracy and quality (type and frequency) in data source selection and patient selection [23,67,71]
Outcome and covariate


Poorly defined variables or inconsistent definitions from clinical trial to real-world data for limited comparability of real-world data The conceptual definition of a data element does not align with the operational definition.
  • De novo data collection [74]

  • Automated electronic health record (EHR) abstraction [69]

  • Characterization of real-world variables or surrogate endpoints [75]

  • Prespecify sensitivity analyses, including quantitative bias analyses [76], in the statistical analysis plan


Medical claims data might have limited use to support regulatory-grade decision-making Claims data have limited clinical outcome data. Combine with EHRs to expand the applicability, coverage, and depth of data [77]
Follow-up


Difficult to capture continuity of care in a single data source Diagnosis is spread across multiple physicians; if the patient moves and seeks care outside of the care network, follow-up data will be lost.
  • Tokenization/data linkage and advanced analytics with EHR data for capturing a more complete patient journey (particularly helpful for rare conditions where the sample size would be low) [23,44,71]

  • Analytical approaches (ie, imputation) for missing data [23,44,71,78]

Time selection


Timing of therapy Patient has multiple lines of treatment; what should be considered the index date? Define a proper index date or “time zero” following the target trial emulation framework [52]

Timing of data collection – inconsistent standard of care over time Data may be present, but are not current enough to provide a reasonable comparison to the current standard of care.
  • De novo data collection [55]

  • Tokenization/data linkage [78]

Geography


External control arm nongeneralizable to clinical practice Geographic representation where the main external control arm data source is from outside of the country of interest. Select two unlinked data sources with available data to obtain a sufficient sample size. However, it is unclear if patients overlap in care networks.
  • Tokenization/data linkage, which improves patient counts with geographic representation while accounting for duplicates [79]

  • Transportability [60]

Analysis phase


Data loss or insufficient sample size to detect power In the analysis phase, during matching, the power to detect an effect is reduced.
  • De novo data collection [55]

  • Tokenization/data linkage [23,44]

  • Analytical approaches (ie, imputation) for missing data [78]


Avoid the appearance of the analysis as post-hoc or cherry picked Data dredging/post-hoc analysis (eg, regulators can assume the most appealing analysis was conducted). Transparent prespecified description of data selection, data provenance, and the statistical analysis plan [3]