Table 1.
EHR Source Information Type | Variables Needed for Analysis | Curation Approaches |
---|---|---|
Structured data (e.g., date of birth) |
|
Transformation, harmonization, and deduplication |
Unstructured data (e.g., clinic notes, PDF lab reports, radiology images, etc.) |
|
Expert abstraction OR ML-extraction |
Abbreviations: ALK: anaplastic lymphoma kinase; EGFR: epidermal growth factor receptor; ECOG: Eastern Cooperative Oncology Group; ML: machine learning; NSCLC: non-small cell lung cancer; PD-L1: programmed death-ligand 1; a mortality date is a composite variable based on multiple data sources (structured and unstructured EHR data, commercial sources, and Social Security Death Index) [16]. ML extraction was not used to define this variable. b Line of therapy and line of therapy date are a derived variable based on both structured and unstructured data inputs.