Table 2.
Datasets used in this study.
| Pilot set | Trial 1 test set | Trial 2 held-out test set | |
|---|---|---|---|
| Description of patients, and enrollment procedure | All 150 patients enrolled in a 2-hospital pilot trial with informed consent over Nov. 2018 to Feb. 20209 | Sample of 160 patients enrolled in a 2,512-person 3-hospital pragmatic trial under a waiver of informed consent, enriched for ADRD (80 of 160 [50%]), over Apr. 2020 to Mar. 202110,11 | All 617 patients enrolled in a 3-hospital comparative-effectiveness trial with informed consent over Jul. 2021 to Nov. 202311 (trial results not yet published) |
| Description of EHR notes | 4,642 notes from index admission to discharge | 2,974 notes from index admission to 30 days postrandomization | 11,574 notes from randomization to 30 days post-randomization |
| Adjudication of ground truth | Manual whole-chart abstraction by human reviewers, with regular quality assurance applied at the passage level | Manual whole-chart abstraction by human reviewers, with regular quality assurance applied at the passage level | BERT NLP-screened human abstraction of passages scoring over 98.5th percentile, a with principal investigator co-review of all positive abstractions by patient |
| Prevalence of GOC discussions | 0.2% of BERT segments; 340 / 4,642 (7.3%) notes; 34 / 150 (23%) patients | 0.4% of BERT segments; 295 / 2,974 (9.9%) notes; 59 / 160 (37%) patients | 304 / 2,136 notes in case-control sample; b 163 / 617 (26%) patients |
| Role of dataset for each NLP model c | |||
|
Llama 3.3 LLM (zero-shot prompt) |
Development | Testing | Held-out testing |
|
BERT (supervised ML) |
Training and validation | Testing | Held-out testing |
Abbreviations: ADRD, Alzheimer disease and related dementias; EHR, electronic health record; BERT, Bidirectional Encoder Representations from Transformers; GOC, goals of care; NLP, natural language processing; LLM, large language model; ML, machine learning.
In the Trial 1 test set, this screening threshold corresponded with 99.3% note-level sensitivity (95%CI 97.6%, 99.9%) and 100% patient-level sensitivity (one-sided 97.5%CI 97.7%, 100%).
Case-control sample of notes in the Trial 2 held-out test: cases, 304 notes with human-confirmed GOC content; controls, up to three randomly sampled GOC-negative notes per patient. Prevalence in sample (14%) is not expected to represent source data.
In supervised machine learning, “training” refers to fitting a model to labeled data; “validation” refers to tuning model hyperparameters during development; and, “testing” refers to evaluating the performance of the final model after all training and development is complete. A “held-out test set” is a special test set that is kept completely separate from all model development, minimizing possibility of indirect leakage or bias. Because zero-shot prompting does not fit a model to any labeled data, the pilot set is referred to as a “development” set for the Llama model.