. 2023 Apr 5;25:e43484. doi: 10.2196/43484

Table 1.

Summary of real-world data identification for comparative efficacy using externally controlled trial challenges, examples, and application of solutions.

Challenges			Examples		Application of solutions
Data source identification of rare conditions
	Indecision gaps due to abundance of real-world data	It is difficult to parse out important data sources, rare disease candidates, and data linkage options.		Machine learning applications can improve accuracy and quality (type and frequency) in data source selection and patient selection [23,67,71]
Outcome and covariate
	Poorly defined variables or inconsistent definitions from clinical trial to real-world data for limited comparability of real-world data	The conceptual definition of a data element does not align with the operational definition.		De novo data collection [74] Automated electronic health record (EHR) abstraction [69] Characterization of real-world variables or surrogate endpoints [75] Prespecify sensitivity analyses, including quantitative bias analyses [76], in the statistical analysis plan
	Medical claims data might have limited use to support regulatory-grade decision-making	Claims data have limited clinical outcome data.		Combine with EHRs to expand the applicability, coverage, and depth of data [77]
Follow-up
	Difficult to capture continuity of care in a single data source	Diagnosis is spread across multiple physicians; if the patient moves and seeks care outside of the care network, follow-up data will be lost.		Tokenization/data linkage and advanced analytics with EHR data for capturing a more complete patient journey (particularly helpful for rare conditions where the sample size would be low) [23,44,71] Analytical approaches (ie, imputation) for missing data [23,44,71,78]
Time selection
	Timing of therapy	Patient has multiple lines of treatment; what should be considered the index date?		Define a proper index date or “time zero” following the target trial emulation framework [52]
	Timing of data collection – inconsistent standard of care over time	Data may be present, but are not current enough to provide a reasonable comparison to the current standard of care.		De novo data collection [55] Tokenization/data linkage [78]
Geography
	External control arm nongeneralizable to clinical practice	Geographic representation where the main external control arm data source is from outside of the country of interest. Select two unlinked data sources with available data to obtain a sufficient sample size. However, it is unclear if patients overlap in care networks.		Tokenization/data linkage, which improves patient counts with geographic representation while accounting for duplicates [79] Transportability [60]
Analysis phase
	Data loss or insufficient sample size to detect power	In the analysis phase, during matching, the power to detect an effect is reduced.		De novo data collection [55] Tokenization/data linkage [23,44] Analytical approaches (ie, imputation) for missing data [78]
	Avoid the appearance of the analysis as post-hoc or cherry picked	Data dredging/post-hoc analysis (eg, regulators can assume the most appealing analysis was conducted).		Transparent prespecified description of data selection, data provenance, and the statistical analysis plan [3]