Skip to main content
Alzheimer's & Dementia logoLink to Alzheimer's & Dementia
. 2025 Jan 9;20(Suppl 4):e087416. doi: 10.1002/alz.087416

Evaluating Large Language Models (LLMs) in Information Extraction: A Case Study of Extracting Cognitive Exam Dates and Scores

Hao Zhang 1,, Neil Jethani 1, Simon Jones 1, Nicholas Genes 1, Vincent J Major 1, Ian S Jaffe 1, Anthony B Cardillo 1, Noah Heilenbach 1, Nadia Fazal Ali 1, Luke J Bonanni 1, Andrew Clayburn 1, Zain Khera 1, Erica C Sadler 1, Jaideep Prasad 1, Jamie Schlacter 1, Kevin Liu 1, Benjamin Silva 1, Sophie Montgomery 1, Eric J Kim 1, Jacob Lester 1, Theodore M Hill 1, Theodore M Hill 1, Alba Avoricani 1, Ethan Chervonski 1, James Davydov 1, William Small 1, Eesha Chakravartty 1, Himanshu Grover 1, John Dodson 1, Abraham A Brody 1, Yindalon Aphinyanaphongs 2, Arjun V Masurkar 1, Narges Razavian 2
PMCID: PMC11713360

Abstract

Background

Large language models (LLMs) provide powerful natural language processing capabilities in medical and clinical tasks. Evaluating LLM performance is crucial due to potential false results. In this study, we assessed ChatGPT and Llama2, two state‐of‐the‐art LLMs, in extracting information from clinical notes, focusing on cognitive tests, specifically the Mini Mental State Exam (MMSE) and Cognitive Dementia Rating (CDR).

Method

We compiled a dataset consisting of 765 clinical notes mentioning MMSE and CDR. 22 medically trained experts provided the ground truth. ChatGPT (GPT‐4, version “2023‐03‐15‐preview”) and Llama2 (“Llama‐2‐70b‐chat”) were used to extract MMSE and CDR instances with corresponding dates. Inference was successful for 742 notes. We used 20 notes for fine‐tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 assigned to two reviewers simultaneously. Precision, sensitivity, true/false negative rates, and accuracy were calculated. For double‐reviewed notes, we qualitatively assessed the errors.

Result

The patient and note characteristics can be found in Table 1. For MMSE information extraction, ChatGPT (vs. Llama2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true‐negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true‐negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and Llama2 on double‐reviewed notes. Llama2 errors included 27 cases of total hallucination, 19 cases where other scores were reported instead of MMSE, 25 missed scores, and 23 cases where the wrong date was reported for the right score. In comparison, ChatGPT’s errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date.

Conclusion

ChatGPT exhibited high accuracy in extracting MMSE scores and dates, with better performance compared to Llama2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.


graphic file with name ALZ-20-e087416-g001.jpg


Articles from Alzheimer's & Dementia are provided here courtesy of Wiley

RESOURCES