Large language models for accurate disease detection in electronic health records: the examples of crystal arthropathies

Nils Bürgisser; Etienne Chalot; Samia Mehouachi; Clement P Buclin; Kim Lauper; Delphine S Courvoisier; Denis Mongin

doi:10.1136/rmdopen-2024-005003

. 2024 Dec 20;10(4):e005003. doi: 10.1136/rmdopen-2024-005003

Large language models for accurate disease detection in electronic health records: the examples of crystal arthropathies

Nils Bürgisser ^1,^2,^3,^✉, Etienne Chalot ⁴, Samia Mehouachi ¹, Clement P Buclin ^2,³, Kim Lauper ^1,³, Delphine S Courvoisier ^1,⁵, Denis Mongin ^1,³

PMCID: PMC11664341 PMID: 39794274

Abstract

Objectives

We propose and test a framework to detect disease diagnosis using a recent large language model (LLM), Meta’s Llama-3-8B, on French-language electronic health record (EHR) documents. Specifically, it focuses on detecting gout (‘goutte’ in French), a ubiquitous French term that has multiple meanings beyond the disease. The study compares the performance of the LLM-based framework with traditional natural language processing techniques and tests its dependence on the parameter used.

Methods

The framework was developed using a training and testing set of 700 paragraphs assessing ‘gout’ from a random selection of EHR documents from a tertiary university hospital in Geneva, Switzerland. All paragraphs were manually reviewed and classified by two healthcare professionals into disease (true gout) and non-disease (gold standard). The LLM’s accuracy was tested using few-shot and chain-of-thought prompting and compared with a regular expression (regex)-based method, focusing on the effects of model parameters and prompt structure. The framework was further validated on 600 paragraphs assessing ‘Calcium Pyrophosphate Deposition Disease (CPPD)’.

Results

The LLM-based algorithm outperformed the regex method, achieving a 92.7% (88.7%–95.4%) positive predictive value, a 96.6% (94.6%–97.8%) negative predictive value and an accuracy of 95.4% (93.6%–96.7%) for gout. In the validation set on CPPD, accuracy was 94.1% (90.2%–97.6%). The LLM framework performed well over a wide range of parameter values.

Conclusion

LLMs accurately detected disease diagnoses from EHRs, even in non-English languages. They could facilitate creating large disease registers in any language, improving disease care assessment and patient recruitment for clinical trials.

Keywords: Gout, Crystal arthropathies, Machine Learning, Chondrocalcinosis

What is already known on this topic.

What this study adds

It proposes a framework based on Meta’s Llama-3-8B, a recent public LLM, which outperforms traditional natural language processing techniques in detecting gout and calcium pyrophosphate deposition disease in unstructured text.
It achieves high positive and negative predictive values and accuracy, with robust performance over a wide range of parameters.

How this study might affect research, practice or policy

The proposed framework can ease the use of LLMs in effectively detecting disease in EHR data for various clinical applications, such as creating large disease registers in any language, improving disease care assessment and patient recruitment for clinical trials.

Introduction

Since the release of their first version in 2018,¹ large language models (LLMs) have improved very quickly, with frequent application in research.² They are effective for many complex tasks in the medical field such as passing advanced exams,³ answering patients questions⁴ or interpreting and extracting clinical concepts and data.^{5 6} They could thus be of high interest to interpret, categorise or extract information from electronic health records (EHRs), especially from the unstructured text (ie, free-text) of the hospital documents, as recently proven for the classification of injuries.⁷

In a previous work, to build a self-updating gout register from hospital EHR data,⁸ we used regular expressions (regex) and natural language processing techniques to identify patients with gout from any hospital documents. The task proved to be rather complex: in addition to situations in which the diagnosis is negated or attributed to a family member, the word gout (‘goutte’ in French) also designs drops and droplets, which are common words used to designate quantity of drugs or body fluids. It took months of back-and-forth to specify the query algorithm able to avoid unexpected uses of the word ‘gout’. LLMs, on the other hand, because of their advanced natural language understanding abilities,⁹ could be able to correctly identify diagnosis with little additional work.

The aim of this article is to propose a general framework for using LLM to detect specific disease diagnoses with the goal of creating automatic EHR registers or facilitating patient recruitment for clinical trials. We test the ability of advanced LLM such as those of the Llama family,¹⁰ using few-shot prompting and chain-of-thought techniques,¹¹ to identify patients with a gout diagnosis from unstructured EHR data of a tertiary university hospital. The performance of the procedure is assessed by comparing the prediction to a gold standard and evaluating it against the performance of a regex-based algorithm. We perform extended sensibility analysis to assess the robustness of our results against LLM parameter change or against prompt structure.¹² We further validate the proposed LLM framework on the detection of another form of crystal arthropathy, calcium pyrophosphate deposition disease (CPPD), from EHR documents. Ease of implementation and computing resources needed are also considered.

Methods

Objective

We aim to test and propose a framework for the LLM allowing to detect a disease diagnosis in an unstructured EHR document. We used a challenging disease to detect in French, gout, to evaluate the framework’s effectiveness, as it can also mean drops or droplets, be used for proverb or even surname, in other contexts. First, the LLM and a regex-based algorithm were used to determine if the sentence containing the word gout indicates a positive diagnosis, or something else (negative diagnosis, diagnosis of a family member, other use of the word gout). In a second step, the LLM framework was validated on a second dataset to determine if sentences containing words referring to calcium pyrophosphate deposition disease (ie, chondrocalcinosis, pseudogout) indicate a positive diagnosis of CPPD or not. CPPD, a subform of calcium pyrophosphate deposition disease, is a frequent differential diagnosis of gout.

Setting

The data stemmed from electronic documents of the Geneva University Hospital (HUG), a 2000 beds tertiary hospital and Swiss’ largest, serving a population of 517 802 residents as well as neighbouring French Nationals working in Switzerland. A random sample of 1000 documents containing the word gout and 500 containing the CPPD-related words ‘chondrocalcinose’ (chondrocalcinosis) or ‘pyrophosphate’ or ‘pseudogoutte’ (pseudogout) were queried from the data-lake of the HUG, a mongoDB mirrored and centralised database version of the EHR data. From these documents, the surrounding of the detected words (gout, and CPPD-related terms for CPPD) was extracted, considering a 30 words boundary before and after the detected word, or sentence punctuation, whichever is closest. These short paragraphs were then manually anonymised before being analysed by the two algorithms. The resulting sentences were then split in three datasets:

A training one concerned with gout, used to tune our algorithms.
A testing one concerned with gout, used to estimate their performances.
A validation one, concerned with CPPD, used as a validation for the proposed LLM framework.

Gold standard

The selected phrases were evaluated by two healthcare professionals (one internal medicine physician, one registered nurse, both trained by a board-certified rheumatologist) to assess if they described the patient as having the disease gout for the training and testing dataset and CPPD for the validation dataset, opposed to a negation of the diagnosis, the description of family diagnosis (eg, the father had gout) or an alternate use of the word (for drugs, body fluids or other uses). Disagreements were resolved by a board-certified rheumatologist.

Regex algorithm

The regex algorithm to detect gout diagnosis consisted in the following steps:

Normalisation of the text (removing special characters, accents, duplicated space, setting to lower case).
Extracting the context of each gout word, the context being considered as eight words before and five after the word gout.
Performing the following tests for each resulting context:
- Presence of a human body liquid, as an indication that gout refers to the droplet.
- Presence of a French expression using the word gout.
- Presence of the words ‘family’, ‘father’, ‘mother’, ‘sister’, ‘brother’ as an indication that the diagnosis concern a family member.
- Any use of gout for diagnosis designing other disease (pseudogout, guttate psoriasis, thick smear for malaria).
- Presence of a negative word and no double negation, as an indication for a negative diagnosis.
- Presence of a drug that can be administrated in droplets (based on the list of all drugs allowed in Switzerland).
Combining the tests. The patient is considered as having gout, if all words in the context are not detected by the tests in step 3.

Details and code can be found at: https://gitlab.unige.ch/goutte/register_validation

Few shots prompting of Llama 3

We used an 8-bit quantized version of the Meta-Llama-3-8B-Instruct model. Meta-Llama-3-8B is the smaller model of the Meta Llama 3 family of large language. The instruct version has been tuned and optimised for dialogue use case. The 8-bit quantized version allowed us to use easily accessible GPU, such as NVIDIA Titan X with 12 GB of VRam, in conjunction with 20G of RAM standard CPU.

The prompt (entire prompt can be found at https://gitlab.unige.ch/goutte/llm_detection_of_diagnosis) consisted in the following steps:

The description of the role of the algorithm (‘you are a text classifier aiming at identifying gout diagnosis’).
A paragraph of context: What is gout, in what context can the word gout be used (ie, to describe the disease or a drop of a medication or a drop of a body liquid such as blood).
Description of the two categories considered (positive gout diagnosis, or other).
The structure of the expected response with a simple chain of thought structure¹³:
- A short explanation detailing the reasoning.
- The result category, based on the short explanation, after the string ‘A:’.
A set of 10 examples with the desired output (technique called few-shot prompting).

The output of the LLM was then parsed following the expected structure.

The parameter set used was a temperature of 0.3, a repetition penalty of 1, and cumulative probability and most likely next word sampling (top_p and top_k), with top_k=40 and top_p=0.95.

Sensitivity analysis

The effect of the temperature and the penalty parameters were tested by varying the temperature from 0.1 to 0.8, and the penalty parameter from 0.8 to 1.4. The impact of the prompt was tested by iteratively removing the steps described previously, or by increasing the number of result categories to 4 (positive gout diagnosis, negative gout diagnosis, liquid or other). Stability of the answer for the standard parameter set was assessed by performing five successive classifications for each sentence.

Statistics

We summarised data using frequencies and percentages for categorical variables and median and IQR for continuous variables. Positive predictive value (PPV) was calculated as the proportion of documents referring to the disease gout correctly detected by the model among all documents referring to the disease. Negative predictive value (NPV) was calculated as the proportion of documents not referring to the disease correctly detected by the model among all non-disease documents. Accuracy was defined by the sum of correct outcome according to the gold standard, divided by the total number of documents tested.

CIs were computed using the Wilson method.¹⁴ All statistics were computed using the software R (R Core Team, 2024, Vienna, Austria) V.4.2.0.¹⁵

Ethical consideration

The use of the gout register⁸ data for quality improvement programmes has been approved by the Geneva ethics commission (CCER 2023-00129). The need for consent was waived by the Geneva Ethics Committee because this study qualifies as a quality improvement initiative.

Results

Of the documents in the testing dataset, we analysed 757 sentences containing the word gout in French, of which 235 sentences (31%) indicated that the patient had the disease gout (according to the manually reviewed charts, see table 1). Of the documents in the validation dataset, we analysed 600 sentences, of which 376 (62%) indicated a positive diagnosis of CPPD.

Table 1. Characteristics of the testing and validation dataset.

	Disease	No disease	P value
Testing dataset: Gout
Number of sentences	235	522
Number of words (median (IQR))	59.00 (36.00–72.00)	62.00 (42.00–71.00)	0.261
Number of occurrence of the word gout (%)			0.797
1	187 (79.6)	409 (78.4)
2	40 (17.0)	98 (18.8)
3+	8 (3.4)	15 (2.9)
Validation dataset: calcium pyrophosphate deposition disease (CPPD)
Number of sentences	376	224
Number of words (median (IQR))	59.50 (37.00–71.00)	59.00 (35.75–69.00)	0.585
Number of occurrence of CPPD-related words (%)			<0.001
1	41 (10.9)	20 (8.9)
2	250 (66.5)	192 (85.7)
3+	69 (18.4)	11 (4.9)

Open in a new tab

Algorithm performance

The LLM-based algorithm tended to perform better than the regex-based one (figure 1), reaching similar PPV (92.7% (88.7%–95.4%) compared with 92.3% (87.9%–95.2%)) but a slightly higher NPV (96.6% (94.6%–97.8%) compared with 92.3% (89.8%–94.3%)) The LLM algorithm had an overall accuracy of 95.4% (93.6%–96.7%) slightly higher than the 92.3% (90.2%–94.0%) accuracy of the regex-based algorithm. Accuracy of both the regex-based and the LLM algorithms in the calibration sample was slightly higher (97.8% (96%–98.7%) and 96.8% (95.5%–98.4%), respectively, for the Llama3 and regex-based algorithm) than in the validation set. The regex-based algorithm took a few seconds to run on the 757 sentences on a standard laptop, while the Llama3 algorithm took one hour and a half using a CPU with 20 GB of RAM and a 12 GB VRAM GPU.

Sensitivity analysis

The results of the LLM-based algorithm were robust against parameter changes. Similar performance was obtained for temperature ranging from 0.1 to 0.8, and for penalty parameter from 0.8 to 1.3. Penalty parameter of 1.4 resulted in the LLM tending to produce non-structured output, which could not be parsed (see online supplemental tables 1 and 2). Concerning the effect of the prompt (table 2):

Table 2. Sensitivity analysis of Llama 3 model with varying prompts.

Prompt type	Outputs without result	Predictive positive value	Negative predictive value	Accuracy
Reference (few shot with context, chain of thoughts and two output categories)	0	92.7% (88.7%–95.4%)	96.6% (94.6%–97.8%)	95.4% (93.6%–96.7%)
Few shot with context, chain of thoughts and four categories output	0	90.2% (85.8%–93.3%)	97.3% (95.5%–98.4%)	95% (93.2%–96.3%)
Few shot with context, no chain of thoughts and four output categories	0	87% (82.2%–90.6%)	95.9% (93.8%–97.3%)	93% (91%–94.6%)
One shot with context, chain of thoughts and four output categories	0	81.5% (76.5%–85.7%)	97.9% (96.2%–98.9%)	91.9% (89.8%–93.7%)
Zero shot with context, chain of thoughts and four output categories	13	84.6% (79.7%–88.6%)	97.6% (95.8%–98.6%)	93.2% (91.1%–94.8%)
Zero shot with no context, chain of thoughts and four output categories	65	63.5% (58.3%–68.4%)	99.4% (97.9%–99.8%)	81.5% (78.4%–84.2%)
Zero shot with context, no chain of thoughts and four output categories	0	89% (83.9%–92.6%)	89.8% (87%–92%)	89.6% (87.2%–91.5%)

Open in a new tab

Adding classifying categories did not result in better prediction and even lowered slightly the PPV (90.2% (85.8%–93.3%) with four categories).

A one-shot approach (only one example) or a zero-shot approach (without examples) produced lower accuracy, mainly due to lower predictive positive value. The one-shot prompts yielded indeed a PPV of 81.5% (76.5%–85.7%) and the zero-shot prompts a PPV of 84.6% (79.7%–88.6%). Of note, zero-shot prompts produced several outputs that did not respect the formatting expected, resulting in 65 unusable outputs.
For the zero-shot prompt:.
- Removing the context strongly lowered the PPV from 84.6% (79.7%–88.6%) to 63.5% (58.3%–68.4%).
- Removing the chain of thought, although allowing more robust output format, resulted in a lower accuracy: from 93.2% (91.1%–94.8%) to 89.6% (87.2%–91.5%).

When testing five inferences for each sentence, the result proved to be the same for the five inferences in 98% of the cases. The classification of eight sentences changed in one of the five inferences, and the classification of 11 sentences changed in two of the five inferences.

Validation

Reusing the LLM with the same parameters and the same prompt structure on a different disease (CPPD) yielded a PPV of 92.3% (88.4%–95.5%) for detecting the presence of CPPD diagnosis, an NPV of 95.9% (91.4%–98.1%), for an overall accuracy of 94.1% (90.2%–97.6%).

Discussion

In this study, Llama 3, a recent LLM, showed excellent positive and negative predictive values in identifying gout diagnosis from unstructured (ie, free-text) medical documents of EHRs of a French-speaking tertiary university hospital. Performance was slightly better than a regex-based algorithm. The tested prompt structure appears to be a promising template for accurately detecting specific diseases, facilitating the creation of fast and easy-to-implement registers.

Previous studies using natural language technique with or without machine learning have been made to detect gout flare, but prior to or without the use of new LLM.^{16 17} A recent study, using a protected health information compliant form of GPT-4 which extracted hepatological imaging report data from an EHR, showed a similar accuracy,⁶ while the use of LLM to classify injuries showed close to perfect classification capabilities.⁷

The technique based on regular expressions performed well but required iterative adaptations of the different tests to appropriately reject false positive outcome. For such technique, the transposition to another disease will be language and context dependent, and thus remain time-consuming. Although the LLM techniques also needed some iterative works to adapt the prompt and the temperature parameter to obtain proper outputs, the sensitivity study showed that it is robust over a large range of parameter values. Our validation using the proposed framework to detect another disease diagnosis confirms its versatility. It may allow easy implementation in other language or for another condition without expensive tuning. The fact that the LLM algorithm had a great performance using only a two-category output, that is without the need to describe the different situation that could lead to false positive, is clearly an advantage. The slightly lower PPV obtained for the validation dataset when compared with the testing one can be explained by the fact that CPPD in our EHR documents was frequently evocated in differential diagnosis lists, a situation for which the gold standard did not consider it as a diagnosis, but the LLM tended to do so.

The computing power needed for the LLM was higher than the regex-based method, which needed only a personal computer. Our computational setup, using an NVIDIA Titan X GPU with 12 GB VRAM and 20 GB RAM, represents an accessible hardware configuration, available both for research or for healthcare settings, thus allowing to deploy LLMs locally, even in institutions with limited computational infrastructure. This emphasises the need for hospitals to provide secure computing power to researchers and clinicians to foster research in free-text documents. Our study suggests that lighter models of LLM, which do not require extensive computing resources, may be of significant interest for such applications. Given the sensitive nature of patient documents, local deployment is compulsory to maintain control over data flow and comply with institutional privacy protocols. The use of a light LLM such as llama 3-8B allows to run the model locally on modest and easily available hardware.

An additional value of our approach lies in the potential to link clinical data from EHRs to large healthcare claims databases, which commonly use ICD-10 codes for disease classification. Diagnoses of rheumatologic diseases such as CPPD are particularly challenging due to potential diagnostic uncertainties, coding inconsistencies and limitations in phenotype specificity. The ICD-10 codes often only refer to chondrocalcinosis rather than specifically identifying CPPD or its acute forms, making it difficult to capture the full clinical spectrum of the disease.¹⁸ Our findings from the development of a gout register highlight that diagnostic codes may not always correspond to confirmed clinical diagnoses, often due to misclassification or coding errors.⁸ This underscores the importance of linking clinical records with claims data to enhance diagnostic accuracy and improve the validity of epidemiological studies in this field.

There are some limitations to our study. First, our methods were tested in a single academic institution, though it covers all medicine specialties. Second, the study was conducted using only one language (French). Llama LLM is known to perform well in English, French, German and Spanish,¹⁹ but results may change strongly for lower resource languages. Third, at this stage, we only identified diagnosis, without the identification of specific situation (ie, gout flare, chronic gouty arthritis, etc). Furthermore, ACR/EULAR classification criteria are now available for both gout and, as of 2023, CPPD.^{20 21} These criteria have become essential entry points in clinical trials.²² While our model can identify patients with gout and CPPD in unstructured documents, it requires additional data sources and analyses to accurately assess all criteria elements, such as the number of flares, types of joints involved, characteristics of symptomatic episodes and radiographic features. Finally, the 90-min processing time required to run the LLM on our dataset may restrict the feasibility of frequent updates. While applications such as register development, patient recruitment for clinical trials and assessment of clinical indicators generally do not demand real-time data, the current computational demands of LLMs might limit their use in settings requiring faster turnaround.

In conclusion, LLM can accurately detect disease diagnosis in EHR’s clinical documents, even when the disease name has a high consonance with other words. Our study proposes a framework that can be easily reused and suggests that LLM can perform well even in language outside their primary training dataset, paving the way for detecting disease with minimal effort, from the rich EHR clinical notes and documents of a hospital. While our results demonstrate robustness within the same model, further validation and comparison with other LLMs or NLP techniques may be of interest to assess potential additional advantages and limitations of our approach. Additionally, future work should consider evaluating diseases with more complex symptom profiles or improving the accuracy of gout and CPPD identification through the application of validated classification criteria. Combined with appropriate resources from hospitals, the template proposed in this study could help link clinical records with claims data to enhance diagnostic accuracy, significantly accelerating the creation of registers and the detection of patients for clinical trial recruitment, reducing the typically months-long process to just days.

supplementary material

online supplemental file 1

rmdopen-10-4-s001.pdf^{(87.4KB, pdf)}

DOI: 10.1136/rmdopen-2024-005003

Acknowledgements

We thank ED at the Information Systems Directorate for his help in accessing the database of the Geneva University Hospitals, as well as support for the implementation of the large language model.

Footnotes

Funding: This project was funded by the Private Foundation of the Geneva University Hospitals, a not-for-profit foundation.

Provenance and peer review: Not commissioned; externally peer reviewed.

Patient consent for publication: Not applicable.

Data availability free text: All prompts and code have been made available at the following gitlab repository: https://gitlab.unige.ch/goutte/llm_detection_of_diagnosis. Due to medical confidentiality, we are unable to share the sentences and document data. However, if authorisation is obtained from the ethics committee, we may be able to provide access to the data.

Contributor Information

Nils Bürgisser, Email: nburgisser@proton.me.

Etienne Chalot, Email: Etienne.Chalot@hug.ch.

Samia Mehouachi, Email: samia.mehouachi@hcuge.ch.

Clement P. Buclin, Email: clement.buclin@hcuge.ch.

Kim Lauper, Email: Kim.lauper@hcuge.ch.

Delphine S. Courvoisier, Email: Delphine.courvoisier@hcuge.ch.

Denis Mongin, Email: denis.mongin@hcuge.ch.

Data availability statement

Data are available in a public, open access repository. Data are available upon reasonable request.

References

1.Devlin J, Chang MW, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. 2019 doi: 10.48550/arXiv.1810.04805. [DOI] [Google Scholar]
2.Fan L, Li L, Ma Z, et al. A Bibliometric Review of Large Language Models Research from 2017 to 2023. arXiv. 2023 doi: 10.48550/arXiv.2304.02020. [DOI] [Google Scholar]
3.Schubert MC, Wick W, Venkataramani V. Performance of Large Language Models on a Neurology Board-Style Examination. JAMA Netw Open. 2023;6:e2346721. doi: 10.1001/jamanetworkopen.2023.46721. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Bernstein IA, Zhang YV, Govil D, et al. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw Open. 2023;6:e2330320. doi: 10.1001/jamanetworkopen.2023.30320. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5:194. doi: 10.1038/s41746-022-00742-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Ge J, Li M, Delk MB, et al. A Comparison of a Large Language Model vs Manual Chart Review for the Extraction of Data Elements From the Electronic Health Record. Gastroenterology. 2024;166:707–9. doi: 10.1053/j.gastro.2023.12.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Lorenzoni G, Gregori D, Bressan S, et al. Use of a Large Language Model to Identify and Classify Injuries With Free-Text Emergency Department Data. JAMA Netw Open. 2024;7:e2413208. doi: 10.1001/jamanetworkopen.2024.13208. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Bürgisser N, Mongin D, Mehouachi S, et al. Development and validation of a self-updating gout register from electronic health records data. RMD Open. 2024;10:e004120. doi: 10.1136/rmdopen-2024-004120. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Naveed H, Khan AU, Qiu S, et al. A comprehensive overview of large language models. arXiv. 2024 doi: 10.48550/arXiv.2307.06435. [DOI] [Google Scholar]
10.Touvron H, Lavril T, Izacard G, et al. LLaMA: open and efficient foundation language models. arXiv. 2023 doi: 10.48550/arXiv.2302.13971. [DOI] [Google Scholar]
11.Brown T, Mann B, Ryder N, et al. Advances in neural information processing systems. Curran Associates, Inc: 2020. Language models are few-shot learners; pp. 1877–901. [Google Scholar]
12.Perlis RH, Fihn SD. Evaluating the Application of Large Language Models in Clinical Research Contexts. JAMA Netw Open. 2023;6:e2335924. doi: 10.1001/jamanetworkopen.2023.35924. [DOI] [PubMed] [Google Scholar]
13.Sahoo P, Singh AK, Saha S, et al. A systematic survey of prompt engineering in large language models: techniques and applications. arXiv. 2024 doi: 10.48550/arXiv.2402.07927. [DOI] [Google Scholar]
14.Wilson EB. Probable Inference, the Law of Succession, and Statistical Inference. J Am Stat Assoc. 1927;22:209. doi: 10.2307/2276774. [DOI] [Google Scholar]
15.R Core Team . Vienna, Austria: R Foundation for Statistical Computing; 2021. R: a language and environment for statistical ## computing.https://www.R-project.org/ Available. [Google Scholar]
16.Zheng C, Rashid N, Wu Y, et al. Using Natural Language Processing and Machine Learning to Identify Gout Flares From Electronic Clinical Notes. Arthritis Care Res (Hoboken) 2014;66:1740–8. doi: 10.1002/acr.22324. [DOI] [PubMed] [Google Scholar]
17.Osborne JD, Booth JS, O’Leary T, et al. Identification of Gout Flares in Chief Complaint Text Using Natural Language Processing. AMIA Annu Symp Proc. 2020:973–82. [PMC free article] [PubMed] [Google Scholar]
18.Tedeschi SK. Issues in CPPD Nomenclature and Classification. Curr Rheumatol Rep. 2019;21:49. doi: 10.1007/s11926-019-0847-4. [DOI] [PubMed] [Google Scholar]
19.Li Z, Shi Y, Liu Z, et al. Quantifying multilingual performance of large language models across languages. arXiv. 2024 doi: 10.48550/arXiv.2404.11553. [DOI] [Google Scholar]
20.Neogi T, Jansen TLTA, Dalbeth N, et al. 2015 Gout Classification Criteria: an American College of Rheumatology/European League Against Rheumatism collaborative initiative. Arthritis Rheumatol. 2015;67:2557–68. doi: 10.1002/art.39254. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Abhishek A, Tedeschi SK, Pascart T, et al. The 2023 American College of Rheumatology/European Alliance of Associations for Rheumatology Classification Criteria for Calcium Pyrophosphate Deposition (CPPD) Disease. Arthritis Rheumatol. 2023;75:1703–13. doi: 10.1002/art.42619. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Tedeschi SK. A New Era for Calcium Pyrophosphate Deposition Disease Research: The First-Ever Calcium Pyrophosphate Deposition Disease Classification Criteria and Considerations for Measuring Outcomes in Calcium Pyrophosphate Deposition Disease. GUCDD. 2024;2:52–9. doi: 10.3390/gucdd2010005. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

online supplemental file 1

rmdopen-10-4-s001.pdf^{(87.4KB, pdf)}

DOI: 10.1136/rmdopen-2024-005003

Data Availability Statement

Data are available in a public, open access repository. Data are available upon reasonable request.

[R1] 1.Devlin J, Chang MW, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. 2019 doi: 10.48550/arXiv.1810.04805. [DOI] [Google Scholar]

[R2] 2.Fan L, Li L, Ma Z, et al. A Bibliometric Review of Large Language Models Research from 2017 to 2023. arXiv. 2023 doi: 10.48550/arXiv.2304.02020. [DOI] [Google Scholar]

[R3] 3.Schubert MC, Wick W, Venkataramani V. Performance of Large Language Models on a Neurology Board-Style Examination. JAMA Netw Open. 2023;6:e2346721. doi: 10.1001/jamanetworkopen.2023.46721. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Bernstein IA, Zhang YV, Govil D, et al. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw Open. 2023;6:e2330320. doi: 10.1001/jamanetworkopen.2023.30320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5:194. doi: 10.1038/s41746-022-00742-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Ge J, Li M, Delk MB, et al. A Comparison of a Large Language Model vs Manual Chart Review for the Extraction of Data Elements From the Electronic Health Record. Gastroenterology. 2024;166:707–9. doi: 10.1053/j.gastro.2023.12.019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Lorenzoni G, Gregori D, Bressan S, et al. Use of a Large Language Model to Identify and Classify Injuries With Free-Text Emergency Department Data. JAMA Netw Open. 2024;7:e2413208. doi: 10.1001/jamanetworkopen.2024.13208. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Bürgisser N, Mongin D, Mehouachi S, et al. Development and validation of a self-updating gout register from electronic health records data. RMD Open. 2024;10:e004120. doi: 10.1136/rmdopen-2024-004120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Naveed H, Khan AU, Qiu S, et al. A comprehensive overview of large language models. arXiv. 2024 doi: 10.48550/arXiv.2307.06435. [DOI] [Google Scholar]

[R10] 10.Touvron H, Lavril T, Izacard G, et al. LLaMA: open and efficient foundation language models. arXiv. 2023 doi: 10.48550/arXiv.2302.13971. [DOI] [Google Scholar]

[R11] 11.Brown T, Mann B, Ryder N, et al. Advances in neural information processing systems. Curran Associates, Inc: 2020. Language models are few-shot learners; pp. 1877–901. [Google Scholar]

[R12] 12.Perlis RH, Fihn SD. Evaluating the Application of Large Language Models in Clinical Research Contexts. JAMA Netw Open. 2023;6:e2335924. doi: 10.1001/jamanetworkopen.2023.35924. [DOI] [PubMed] [Google Scholar]

[R13] 13.Sahoo P, Singh AK, Saha S, et al. A systematic survey of prompt engineering in large language models: techniques and applications. arXiv. 2024 doi: 10.48550/arXiv.2402.07927. [DOI] [Google Scholar]

[R14] 14.Wilson EB. Probable Inference, the Law of Succession, and Statistical Inference. J Am Stat Assoc. 1927;22:209. doi: 10.2307/2276774. [DOI] [Google Scholar]

[R15] 15.R Core Team . Vienna, Austria: R Foundation for Statistical Computing; 2021. R: a language and environment for statistical ## computing.https://www.R-project.org/ Available. [Google Scholar]

[R16] 16.Zheng C, Rashid N, Wu Y, et al. Using Natural Language Processing and Machine Learning to Identify Gout Flares From Electronic Clinical Notes. Arthritis Care Res (Hoboken) 2014;66:1740–8. doi: 10.1002/acr.22324. [DOI] [PubMed] [Google Scholar]

[R17] 17.Osborne JD, Booth JS, O’Leary T, et al. Identification of Gout Flares in Chief Complaint Text Using Natural Language Processing. AMIA Annu Symp Proc. 2020:973–82. [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Tedeschi SK. Issues in CPPD Nomenclature and Classification. Curr Rheumatol Rep. 2019;21:49. doi: 10.1007/s11926-019-0847-4. [DOI] [PubMed] [Google Scholar]

[R19] 19.Li Z, Shi Y, Liu Z, et al. Quantifying multilingual performance of large language models across languages. arXiv. 2024 doi: 10.48550/arXiv.2404.11553. [DOI] [Google Scholar]

[R20] 20.Neogi T, Jansen TLTA, Dalbeth N, et al. 2015 Gout Classification Criteria: an American College of Rheumatology/European League Against Rheumatism collaborative initiative. Arthritis Rheumatol. 2015;67:2557–68. doi: 10.1002/art.39254. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Abhishek A, Tedeschi SK, Pascart T, et al. The 2023 American College of Rheumatology/European Alliance of Associations for Rheumatology Classification Criteria for Calcium Pyrophosphate Deposition (CPPD) Disease. Arthritis Rheumatol. 2023;75:1703–13. doi: 10.1002/art.42619. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Tedeschi SK. A New Era for Calcium Pyrophosphate Deposition Disease Research: The First-Ever Calcium Pyrophosphate Deposition Disease Classification Criteria and Considerations for Measuring Outcomes in Calcium Pyrophosphate Deposition Disease. GUCDD. 2024;2:52–9. doi: 10.3390/gucdd2010005. [DOI] [Google Scholar]

PERMALINK

Large language models for accurate disease detection in electronic health records: the examples of crystal arthropathies

Nils Bürgisser

Etienne Chalot

Samia Mehouachi

Clement P Buclin

Kim Lauper

Delphine S Courvoisier

Denis Mongin

Abstract

Objectives

Methods

Results

Conclusion

What is already known on this topic.

What this study adds

How this study might affect research, practice or policy

Introduction

Methods

Objective

Setting

Gold standard

Regex algorithm

Few shots prompting of Llama 3

Sensitivity analysis

Statistics

Ethical consideration

Results

Table 1. Characteristics of the testing and validation dataset.

Algorithm performance

Figure 1. Confusion matrix of the two algorithms, compared with the gold standard.

Sensitivity analysis

Table 2. Sensitivity analysis of Llama 3 model with varying prompts.

Validation

Discussion

supplementary material

Acknowledgements

Footnotes

Contributor Information

Data availability statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases