. 2024 Jan 19;8:e2300046. doi: 10.1200/CCI.23.00046

TABLE 2.

Data Reliability Subdimensions in Flatiron Health RWD

Data Type	Structured Data	Unstructured Data
Processing Method	Harmonization	Human Abstraction	ML/NLP Extraction
Accuracy	Data collected in structured format for primary purposes are harmonized to reference terminologies for secondary research use. Mapping processes are manually reviewed by the medical informatics team to ensure conformance to standards (external or Flatiron Health–established). Mapping guidelines are updated as new types of EHR data are available, or to reflect changes to secondary uses of EHR data	Data are validated using one or more of the following approaches: (1) Validation against an external reference standard; (2) Indirect benchmarking (data distributions, outcomes, etc) against literature and/or oncology clinical expert guidance; (3) Validation against an internal reference standard; and (4) Verification checks as proxies for accuracy	Data are validated using an internal reference standard, typically human abstracted data. Using the internal reference standard, metrics (eg, sensitivity, specificity, PPV, and NPV) are assessed
Completeness	Data completeness is reflective of data availability within the EHR and is maximized by ensuring timeliness of data capture and integrity of data pipelines. Sites with low completeness during integration are excluded until they meet target thresholds. Quality control checks detect any large drops in data that would signal issues with integrations or ETL pipelines	Data abstraction forms are built with logic checks to ensure data are input when needed. Data completeness distributions are assessed according to expectations for data availability within the EHR; if expectations are not met, then further investigation is conducted to find and correct the root cause	Data completeness is expected to reflect data availability within the EHR and is assessed by determining sensitivity of data capture against the reference data upon which the ML algorithm is trained and validated
Provenance	Data are traceable to the source. Harmonization rules dictating data mapping are maintained, updated as source data changes, and available as needed. Reference terminologies to which source data are mapped are updated, versioned, and maintained	Individual patient data points are traceable via a proprietary technology platform^a with an audit trail of abstracted data inputs, changes, and source documentation from the EHR reviewed by trained clinical abstractors. Policies and procedures and data abstraction forms are version-controlled	Data are traceable to source documentation via audit trails for NLP-acquired text. The ML algorithm is archived, and algorithm updates are logged
Timeliness	Mapped data are refreshed on a 24-hour cadence. Data pipelines are continually monitored, and sites with stale structured data are excluded	New EHR documentation considered relevant for a given variable is identified and surfaced for abstraction with set recency (typically 30 days) to facilitate incremental updates. Abstraction resulting from new EHR documentation is reviewed and completed before data cutoff. Document ingestion is monitored	Information that is documented within the EHR by time of data cutoff, whether it exists in structured or unstructured formats, is ingested and processed such that it is available for ML extraction

Abbreviations: EHR, electronic health records; ETL, extract, transform, and load; ML, machine learning; NLP, natural language processing; NPV, negative predictive value; PPV, positive predictive value; RWD, real-world data.

Shklarski et al.³³