Skip to main content
. 2023 Mar 20;15(6):1853. doi: 10.3390/cancers15061853

Table 1.

Study variables and EHR data source.

EHR Source Information Type Variables Needed for Analysis Curation Approaches
Structured data
(e.g., date of birth)
  1. Diagnoses (i.e., ICD codes)

  2. Gender

  3. Birth year

  4. Race

  5. Ethnicity

  6. Practice type

  7. ECOG performance status

  8. Medication order date

  9. Medication administration date

  10. Visit date

  11. Mortality date a

Transformation,
harmonization, and
deduplication
Unstructured data
(e.g., clinic notes, PDF lab reports, radiology images, etc.)
  1. NSCLC diagnosis

  2. NSCLC diagnosis date

  3. Advanced NSCLC diagnosis

  4. Advanced NSCLC diagnosis date

  5. ROS1 test result

  6. ROS1 test date

  7. ALK test result

  8. BRAF test result

  9. EGFR test result

  10. ALK test date

  11. BRAF test date

  12. EGFR test date

  13. PD-L1 percent staining

  14. PD-L1 test result date

  15. Group stage

  16. Histology

  17. Line of therapy b

  18. Line of therapy start date b

Expert abstraction
OR
ML-extraction

Abbreviations: ALK: anaplastic lymphoma kinase; EGFR: epidermal growth factor receptor; ECOG: Eastern Cooperative Oncology Group; ML: machine learning; NSCLC: non-small cell lung cancer; PD-L1: programmed death-ligand 1; a mortality date is a composite variable based on multiple data sources (structured and unstructured EHR data, commercial sources, and Social Security Death Index) [16]. ML extraction was not used to define this variable. b Line of therapy and line of therapy date are a derived variable based on both structured and unstructured data inputs.