. 2022 Jun 22;14(13):3063. doi: 10.3390/cancers14133063

Table 4.

Evaluation framework template for the illustrative example.

Variable: Metastatic Diagnosis (yes/no)
Model Description¹ Inputs to the model include unstructured documents from the EHR (e.g., visit notes, pathology/radiology reports). The output of the model is a binary prediction (yes/no) for whether the patient has a metastatic diagnosis at any time in the record.
Target Dataset/Population The model is used in a dataset that contains patients with non-small cell lung cancer (NSCLC).
Common Analytic Use Case Selecting a cohort of patients who have (or do not have) metastatic disease Using metastatic status as a covariate or stratifying variable in an analysis
ML-Extracted Variable Evaluation
Components	Description	Hypothetical Results and Findings
Test Set	The size of the test set is selected to achieve a target margin of error for the primary evaluation metric (e.g, sensitivity or PPV) within the minority class (metastatic disease). To measure model performance, a random sample of patients is taken from a NSCLC cohort and withheld from model development.	Patients selected from the target population which is not included in model development
Overall Performance	As the primary use of this variable is to select a cohort of metastatic patients, sensitivity, PPV, specificity, and NPV are measured. To evaluate how well this variable selects a metastatic cohort, emphasis is placed on sensitivity and PPV to understand the proportion of patients missed and the proportion of patients incorrectly included in the final cohort.	Sensitivity ² = 0.94 PPV ³ = 0.91 Specificity ⁴ = 0.90 NPV ⁵ = 0.90
Stratified Performance	Sensitivity and PPV for both Metastatic and Non-metastatic classes are calculated across strata of variables of interest. Stratifying variables are selected with the following goals in mind: Performance in sub-cohorts of interest (e.g., year of diagnosis) Fairness (e.g., race and ethnicity) Risk for statistical bias in analysis (e.g., cancer stage at diagnosis)	Example finding for race and ethnicity: Sensitivity for the “metastatic” class is 5% better for “Black or African American” race group vs. “White”. PPV for the “metastatic” class is 5% lower for “Black or African American” race group vs. “White”
Quantitative Error Analysis	To understand the impact of model errors on the selected study cohort, baseline characteristics and rwOS are evaluated for the following groups True positives vs. false negatives True positives vs. false positives Typically, patients with non-metastatic disease have longer survival times than patients with metastatic disease. If model misclassification is random, the inclusion of false positives in the study cohort will result in longer observed survival times. However, if model misclassification is systematic and false positives have survival similar to patients with metastatic disease, then the distribution of survival times may remain relatively unchanged.	Example findings from rwOS analysis : rwOS * for False Positives (21 months) was similar to True Positives (17 months). Example findings from baseline characteristic analysis: Compared to true negatives, false positives are less likely to have a history of smoking (86% vs. 91%).
Replication of Use Cases	Evaluate rwOS from metastatic diagnosis date for patients selected as metastatic by the ML-extracted variable vs. abstracted counterpart (outcomes in the general population)	rwOS for ML extracted cohort: 9.8 months (95% CI 8.92–10.75) rwOS for abstracted cohort: 9.8 months (95% CI 8.92–10.69)

¹: Model is constructed using snippets of text around key terms related to “metastasis,” and processed by a long short-term memory (LSTM) network to produce a compact vector representation of each sentence. These representations were then processed by additional network layers to produce a final metastatic status prediction [31]. ²: Sensitivity refers to the proportion of patients abstracted as having a value of a variable (e.g., metastasis = true) that are also ML-extracted as having the same value. ³: PPV refers to the proportion of patients ML-extracted as having a value of a variable (e.g., metastasis = true) that is also abstracted as having the same value. ⁴: Specificity refers to the proportion of patients abstracted as not having a value of a variable (e.g., metastasis = false) that are also ML-extracted as not having the same value. ⁵: NPV refers to the proportion of patients ML-extracted as not having a value of a variable (e.g., metastasis = false) that are also abstracted as not having the same value. *: rwOS analysis was performed using Kaplan–Meier method [32]. **: The index date selected for rwOS calculation can be changed based on the study goals. However, the index date that is selected should be available for all patients, regardless of the concordance of their abstracted and predicted value. In this illustrative example, we provided the rwOS strictly as an example and do not specify the index date as index date selection will be case-dependent.