Skip to main content
. 2023 May 25;25:e45662. doi: 10.2196/45662

Table 2.

Guidelines for prespecification and reporting of data creation.


Item number Recommendation
Data harmonization

Target RCTa study design 1
  • Prespecification: Indicate the source of the study design document (protocol, reporting paper, or others).

  • Reporting: Describe the sections and tables from which the relevant variables are recognized.


RCT variable list 2
  • Prespecification: Specify the method for recognizing variables from the RCT document and matching to relevant EHRb features.

  • Reporting: Define all end points, interventions, eligibility criteria, and other baseline characteristics recognized from the RCT document along with the matched EHR features.

Cohort construction

Data mart 3
  • Prespecification: Specify the method for compiling the broad list of EHR features indicating the condition or disease of interest.

  • Reporting: List the EHR features, and state the algorithm to define inclusion in the data mart.

  • Reporting: Report the size of the data mart.


Disease cohort 4
  • Prespecification: Specify the method for ascertaining the phenotype of the condition or disease of interest.

  • Reporting: Describe the input EHR features of the phenotyping algorithm.

  • Reporting: State the phenotyping algorithm with chosen parameters.

  • Reporting: Report the AUCc of prediction and the accuracy of the disease cohort.


Treatment arms 5
  • Reporting: Explain how treatment initiation time is determined.

  • Reporting: Explain how treatment arms are defined with the list of involved EHR features, time windows, and the algorithm.

Variable curation

End points 6
  • Prespecification: Specify the method for ascertaining the end points.

  • Reporting: State the end point algorithm with chosen parameters.

  • Reporting: Explain how the end point is defined.


Baseline characteristics (eligibility criteria and confounders) 7
  • Prespecification: Specify the variable curation plans for each class of baseline characteristics.

  • Reporting: List the baseline characteristics considered in the RWEd and define how they are created with input EHR features, time windows, groupings, and other transformation.

  • Reporting: Explain how eligibility criteria will be matched according to the curated baseline characteristics.

  • Reporting: Present the summary statistics of the baseline characteristics in treatment arms filtered by eligibility criteria.


Additional confounders 8
  • Prespecification: List the other confounders considered in the RWE and define how they are created with input EHR features, time windows, groupings, and other transformation.


Missing data 9
  • Reporting: Describe how missing information on variables are handled.

Validation

Sampling strategy 10
  • Prespecification: Specify the sampling strategy for the validation set.

  • Reporting: Report the sizes and list of variables (in data mart, disease cohort, and arms) of the validation.


Data accuracy 11
  • Reporting: Report the agreement between gold-standard data from validation and curated data.

  • Reporting: Explain how inaccurate data are dealt with.

Publication 12
  • Reporting: Export the final curation models for the condition or disease of interest, end points, and other variables curated through machine learning methods.

aRCT: randomized controlled trial.

bEHR: electronic health record.

cAUC: area under the curve.

dRWE: real-world evidence.