Skip to main content
[Preprint]. 2025 May 25:2025.05.23.25328115. [Version 1] doi: 10.1101/2025.05.23.25328115

Table 2.

Datasets used in this study.

Pilot set Trial 1 test set Trial 2 held-out test set
Description of patients, and enrollment procedure All 150 patients enrolled in a 2-hospital pilot trial with informed consent over Nov. 2018 to Feb. 20209 Sample of 160 patients enrolled in a 2,512-person 3-hospital pragmatic trial under a waiver of informed consent, enriched for ADRD (80 of 160 [50%]), over Apr. 2020 to Mar. 202110,11 All 617 patients enrolled in a 3-hospital comparative-effectiveness trial with informed consent over Jul. 2021 to Nov. 202311 (trial results not yet published)
Description of EHR notes 4,642 notes from index admission to discharge 2,974 notes from index admission to 30 days postrandomization 11,574 notes from randomization to 30 days post-randomization
Adjudication of ground truth Manual whole-chart abstraction by human reviewers, with regular quality assurance applied at the passage level Manual whole-chart abstraction by human reviewers, with regular quality assurance applied at the passage level BERT NLP-screened human abstraction of passages scoring over 98.5th percentile, a with principal investigator co-review of all positive abstractions by patient
Prevalence of GOC discussions 0.2% of BERT segments; 340 / 4,642 (7.3%) notes; 34 / 150 (23%) patients 0.4% of BERT segments; 295 / 2,974 (9.9%) notes; 59 / 160 (37%) patients 304 / 2,136 notes in case-control sample; b 163 / 617 (26%) patients
Role of dataset for each NLP model c
Llama 3.3 LLM
(zero-shot prompt)
Development Testing Held-out testing
BERT
(supervised ML)
Training and validation Testing Held-out testing

Abbreviations: ADRD, Alzheimer disease and related dementias; EHR, electronic health record; BERT, Bidirectional Encoder Representations from Transformers; GOC, goals of care; NLP, natural language processing; LLM, large language model; ML, machine learning.

a

In the Trial 1 test set, this screening threshold corresponded with 99.3% note-level sensitivity (95%CI 97.6%, 99.9%) and 100% patient-level sensitivity (one-sided 97.5%CI 97.7%, 100%).

b

Case-control sample of notes in the Trial 2 held-out test: cases, 304 notes with human-confirmed GOC content; controls, up to three randomly sampled GOC-negative notes per patient. Prevalence in sample (14%) is not expected to represent source data.

c

In supervised machine learning, “training” refers to fitting a model to labeled data; “validation” refers to tuning model hyperparameters during development; and, “testing” refers to evaluating the performance of the final model after all training and development is complete. A “held-out test set” is a special test set that is kept completely separate from all model development, minimizing possibility of indirect leakage or bias. Because zero-shot prompting does not fit a model to any labeled data, the pilot set is referred to as a “development” set for the Llama model.