Table 4. Performance of the LLM and Text String Search Compared With Criterion Standard (n = 400)a.
Precision | Recall | F1 Score | Cohen κ (95% CI) | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Wearing helmet | Not wearing helmet | Unknown | Wearing helmet | Not wearing helmet | Recall unknown | Wearing helmet | Not wearing helmet | Unknown | ||
LLM | 0.98 | 1.00 | 0.98 | 0.98 | 0.96 | 1.00 | 0.98 | 0.980 | 1.00 | 0.98 (0.96-1.00) |
Text string search | 0.98 | 1.00 | 0.98 | 0.98 | 0.96 | 1.00 | 0.98 | 0.98 | 1.00 | 0.98 (0.96-1.00) |
Abbreviation: LLM, large language model.
These 400 records were a random draw from the high-detail prompt (December 7, 2023).