Clinical Information Extraction with Large Language Models: A Case Study on Organ Procurement

Hammaad Adam; Junjing Lin; Jianchang Lin; Hillary Keenan; Ashia Wilson; Marzyeh Ghassemi

. 2025 May 22;2024:115–123.

Clinical Information Extraction with Large Language Models: A Case Study on Organ Procurement

Hammaad Adam ¹, Junjing Lin ², Jianchang Lin ², Hillary Keenan ², Ashia Wilson ¹, Marzyeh Ghassemi ^1,³

PMCID: PMC12099322 PMID: 40417525

Abstract

Recent work has demonstrated that large language models (LLMs) are powerful tools for clinical information extraction from unstructured text. However, existing approaches have largely ignored the extraction of numeric information such as laboratory tests and vital signs. In this article, we present a case study on organ procurement that evaluates the ability of LLMs to extract numeric data from clinical text. We first describe our LLM-based approach, introducing a prompting strategy for numeric extraction and novel heuristics to combat hallucination. We validate our approach on a hand-annotated set of 298 notes, demonstrating that it has high accuracy, precision and recall. We then highlight the value of our approach for downstream data analysis using a corpus of 43,719 notes on 14,342 potential organ donors. This case study is a key component of an ongoing collaboration that aims to make data on organ procurement publicly available for informatics research.

Introduction

Incomplete data is a key challenge for machine learning and statistical inference with electronic health records (EHRs). Data that could be stored in structured fields–such as laboratory tests, vital signs, and diagnoses–is often only contained in unstructured, free-text clinical notes.^1,2 Extracting structured information from unstructured notes is challenging. Many current extraction approaches require crafting custom regular expressions,³ which can be tedious and application-specific. Other machine learning approaches require a large number of labeled examples⁴ and can be computationally expensive to train.

Large language models (LLMs) offer a potential solution to this problem. Recent work has demonstrated that LLMs can be used to extract clinical information from free-text notes without being explicitly trained to do so.^5,6 For example, one article found that an LLM (InstructGPT⁷) was able to accurately extract biomedical evidence, medications, and other useful information from clinical text.⁶ Such work is promising, as these methods require no additional model training and need very few labeled examples. However, existing LLM approaches have largely focused on the extraction of non-numeric information. No work yet has systematically analyzed the performance of LLMs at extracting numeric values (e.g., vital signs, laboratory tests) from clinical text.

In this paper, we present a novel case study that demonstrates the ability of LLMs to extract numeric clinical information. Our work focuses on organ procurement, a vital process in the overall system of organ transplantation. In the United States (US), organ procurement–the process by which organs are recovered from deceased donors–is managed regionally by organ procurement organizations (OPOs). OPOs make several important medical decisions, including evaluating the suitability of potential donors to provide transplant-viable organs. Analyzing administrative data collected by OPOs can yield great insights on these decisions and the efficiency of the procurement process. However, like many other clinical databases, OPO data is incomplete. Many critical variables (e.g. lab results) are inconsistently noted in structured fields and are more commonly found in free-text case notes made by OPO staff.

Here, we showcase a novel LLM approach that extracts vital signs and lab results from OPO notes. This study is a key component of an ongoing collaboration that aims to make OPO data publicly available for research.^8,9 We used Llama-2 7B¹⁰–a relatively small, open-source LLM–to extract eight donation-relevant measurements: blood urea nitrogen (BUN), serum creatinine, aspartate aminotransferase (AST), alanine aminotransferase (ALT), total bilirubin (Tbili), systolic blood pressure (BP), diastolic BP, and heart rate (HR). We used a set of 298 annotated notes to demonstrate that our approach accurately extracts these variables with access to just five labeled examples and no additional training. Our approach had both high precision and recall, outperforming a rules-based method designed specifically for this task. We then highlight the value of the extracted information using a corpus of 43,719 notes on 14,342 potential donors, demonstrating that LLM-extracted measurements can improve downstream data analysis.

Our work differs from previous LLM approaches for clinical information extraction in three key ways. First, we focused specifically on the extraction of numeric information. Second, we used a relatively small, open-source LLM. Many current LLM approaches can be challenging to deploy in real-world medical settings due to privacy concerns or computational limits. For example, InstructGPT is a closed-source model which may be risky to use with protected health information (PHI) as it requires transmitting data through an Application Programming Interface (API). In contrast, our work used Llama-2 7B, an open-source model that could be run on a single graphics processor unit (GPU) with less than 25 gigabytes (GB) of random access memory (RAM). Third, we specifically addressed the problem of “hallucination,” a frequently observed phenomena in which LLMs output plausible, yet incorrect information. We implemented three heuristics to combat hallucination that improved extraction accuracy.

Methods

Data

We used a large corpus of 43,819 notes on 14,442 potential donors evaluated by OPOs for their suitability for organ donation. These notes contained detailed information on the potential donor’s hospital admission, diagnostic tests, brain death testing, and family interactions. The notes were obtained along with structured clinical data as part of a collaboration between researchers at the Massachusetts Institute of Technology (MIT) and six OPOs that aims to make organ procurement process data publicly available for research.⁹ As the data contained identifying information, it was stored on a server at MIT that was compliant with the Health Insurance Portability and Accountability Act (HIPAA). The transfer, storage, and use of this data was approved by the MIT Committee on the Use of Humans as Experimental Subjects (protocol 2201000540A001).

Extraction Methodology

Task. Our work focused on extracting eight numeric variables from the OPO notes: BUN, creatinine, AST, ALT, Tbili, systolic blood pressure, diastolic blood pressure, and heart rate. These variables are all crucial in assessing the suitability of a potential donor for transplant. BUN and creatinine are measures of kidney function,¹¹ AST, ALT, and Tbili are measures of liver function,¹² while blood pressure and heart rate are vital signs that are crucial to monitor during donor management.¹³ However, these values are noted extremely inconsistently in the OPO notes. For example, many different abbreviations are used for creatinine, including “crea”, “cr”, “creat”, and “creatine.” Different measurements are also listed in different ways; for example, measurements of BUN and creatinine may be listed sequentially (e.g., “BUN 10 crea 1.1”) or together (e.g., “BUN / Crea 10 / 1.1). These inconsistencies made this extraction challenging for rules-based approaches that rely on regular expressions. Moreover, many machine learning approaches were not applicable to our setting due to privacy and compute limitations. For example, numeric information extraction can be framed as a named entity recognition (NER) task; however, fine-tuning existing machine learning NER methods (e.g., NeuroNER¹⁴) was infeasible due to the lack of labeled examples and limited computational capacity of the PHI-compliant server on which the OPO notes were stored. As a result, we were constrained to approaches that required almost no labeled examples and no additional model training.

LLM approach. To overcome the challenges described above, we developed an LLM-based approach for extracting numeric values from clinical text. Our approach used the Llama-2 model, an open-source pre-trained LLM from Meta.¹⁰ Llama-2 is a prompt-based generative model that generates free text in response to a provided free-text input (i.e., a prompt). Our approach provided Llama-2 with a tailored prompt that primed it to generate text summarizing the eight clinical variables contained in a given OPO note. Figure 1 provides an example of a prompt used in our extraction approach. The prompt consisted of an instruction to the model, an example of a successful extraction, and the OPO note that was the target for extraction. Providing an example of a successful extraction is important, as it primed Llama-2 to perform the desired task and output text in the desired format (i.e., JSON). The generated text was then converted to tabular form with minimal post-processing, yielding the extracted numeric values.

Figure 1. — An example of our LLM extraction approach. The prompt provided to the LLM contains an instruction, an example of a successful extraction, and an input note. In response, the LLM generates text in JSON format that summarizes the eight relevant variables present in the input note. Note that this figure is a toy example used for exposition, not an actual generation from the LLM.

Our approach is an application of in-context learning,¹⁵ where an LLM can learn to perform a task by analogy (i.e., from an example provided in the prompt) instead of further training on labeled data. While the prompt in Figure 1 provides only one example to the LLM, we consider approaches that provide more examples (or “shots”) in our experiments. We varied the number of examples provided between one (i.e., “one-shot”) and five (“five-shot”). To obtain these five examples, we adapted notes from the OPO dataset. We used real notes (i.e., as written by OPO staff) with two alterations. First, we shortened the examples to only include the portion of the note that contained the relevant numeric values. This greatly reduced the length of the input prompt, reducing the time taken for the extraction. Second, we replaced the actual numeric values with rare values (e.g., replacing “Creatinine 1.4” with “Creatinine 10.4”). We elaborate on the rationale for this modification in the following section on hallucination.

We make two quick notes on our model choice. First, we used the “Chat-HF” version of Llama-2; this variant was specifically trained to generate text that conforms to user-specified instructions, thus being the best fit for our extraction task. Second, our work used the smallest version of Llama-2 (i.e., Llama-2 7B), which fits on a single GPU with under 25GB of RAM when loaded in half-precision floating point format. This choice was dictated by computational constraints: the PHI-compliant server on which the OPO notes were stored did not have sufficient RAM to host larger models. Using larger models would likely have provided better results; however, our work demonstrates that our approach can be effective even in settings with computational constraints.

Hallucination heuristics. A frequently observed problem with LLMs is their tendency to “hallucinate,” that is, generate plausible but incorrect information.¹⁶ Our LLM approach detailed in the previous section occasionally exhibited such behavior; for example, the LLM sometimes output a common value of 1.5 for creatinine, even if no mention of creatinine was made in the note. We thus implemented three heuristics that detected hallucinations and increased the accuracy of our LLM extractions. Note that these heuristics are applied as a post-processing step to the raw text generated by the LLM.

Heuristic 1: Cross-check original note. We used regular expressions to check if every value output by the LLM was present in the source note. For example, if the LLM extracted a BUN value of 16, we checked if the number 16 was present in the original text. If it was not, we replaced the extracted value with null.
Heuristic 2: Remove values from examples in the prompt. We observed a tendency for the LLM to output values from the examples in the prompt if no value was specified in the input note. We thus introduced a second heuristic which replaced extracted values with nulls if they were present in one of the five example notes. For example, in the prompt in Figure 1, the example provided to the LLM had a BUN value of 212. Thus, if the LLM predicted a BUN of 212 for the target note, we replaced it with null. Note that this heuristic may remove accurate extractions if the example and input note have the same value for a given variable. To minimize this risk, we included rare values of each variable in the example notes. For example, the prompt in Figure 1 had an example BUN of 212 and an creatinine of 9.9, both of which are rare. This heuristic was thus unlikely to remove values that were accurately extracted from the input note.
Heuristic 3: Remove implausible extractions. Finally, we removed any extracted values that fell outside plausible ranges for the corresponding variables. For example, a creatinine value of 30 is extremely rare and can likely be discarded as an inaccurate extraction. We only included LLM-extracted values that fell in the following ranges (not including border values): 0-200 for BUN, 0-20 for creatinine, greater than 0 for AST and ALT, 0-100 for Tbili, 20-200 for both blood pressures, and 15-200 for heart rate.

Model Evaluation

We used an annotated subset of 298 notes to evaluate the performance of our LLM approach. These notes were hand-annotated by one of this article’s authors, who manually inspected each note and extracted the values of the eight relevant variables. The annotated notes were randomly-selected from the overall corpus and were randomly split into a validation set and a testing set. These sets are described in further detail below. In both sets, we evaluated model performance using three metrics, reported separately for each of the eight variables.

Accuracy: how often did the value extracted by the LLM from an input note match the true value?
Precision: if the LLM extracted a value from an input note, how often was it correct?
Recall: if the input note contained a value, how often did the LLM correctly extract this value?

Validation set. The validation set consisted of 198 notes that were used to assess model performance during method development. A key part of the validation exercise was to determine how performance varied with the number of examples provided to the LLM in the prompt. We thus varied the number of examples provided between one and five. For a fixed number of examples, we ran the extraction five times, randomly selecting which of the five examples to provide to the LLM in each iteration. We report accuracy, precision, and recall averaged across these five iterations. We also used the validation set to assess the impact of the hallucination heuristics on precision.

Test set. Since certain modeling choices (e.g., the number of examples) were based on performance in the validation set, this could not be considered an evaluation on unseen data. We thus evaluated the performance of the LLM extraction on a separate test set of 100 examples that were not accessed by the authors during method development. Here, we focused only on the five-shot case (i.e., provided the LLM with five examples in the prompt).

Baseline. In both the validation and test sets, we compared the performance of the LLM approach to a rules-based approach that used regular expressions (regex). The regex were designed to detect patterns that resembled a variable name (or synonym), followed by a whitespace, followed by a number (e.g., “crea 1.6”, “cr 1.6”, “bun 212”). We conducted this rules-based extraction for each of the eight variables and report accuracy, precision, and recall.

Downstream Value of Extracted Data

Finally, we demonstrate the value of the LLM extraction in a downstream data analysis task. In this analysis, we aimed to understand the drivers of an OPO’s decision to approach a potential donor’s family to initiate discussions about donation. This is one of the most important decisions made by an OPO, as donations typically do not proceed unless the patient’s family is approached. To maximize opportunities with its available resources, the OPO prioritizes approaching those donors that it determines have healthy enough organs to lead to successful transplants.

We estimated the effect of four clinical variables on the OPO’s decision to approach a potential donor’s family: BUN, creatinine, AST, and Tbili. We chose these variables as they convey function of the kidneys and liver, the two most commonly transplanted organ types.¹ We extracted these four variables from a corpus of 43,719 notes made on 14,342 potential donors from one OPO. We supplemented the LLM-extracted observations with structured data from the OPO’s database (hereafter referred to as “tabular data”). We excluded any patients who did not have at least one observed value for each of the four considered variables. If a potential donor had multiple observations for any variables, we included the mean of these observations in our analysis. We then fit a logistic regression to estimate the association between these four variables and the binary outcome of the potential donor’s family being approached, controlling for the patient’s age, cause of death, and whether they had suffered brain death. Note that AST was included as a log-transformed variable due to its heavy right-skew. We did not include the log of ALT, as it was highly correlated with AST (correlation coefficient: 0.86).

We report the estimated regression coefficient and standard error from the logistic regression. We then compare our results with two alternate approaches: 1) using only the tabular data (i.e., the data stored in structured fields), and 2) using the tabular data combined with measurements extracted from the notes by the regex baseline.

Results

Validation Set. Figure 2 describes the performance of our LLM extraction approach on the validation set. The LLM approach is extremely accurate, extracting every variable with greater than 90% accuracy, precision, and recall. All three metrics increased with the number of examples provided to the LLM. The five-shot LLM approach was particularly strong, outperforming the regex baseline on accuracy and recall for all eight variables. The disparity in recall was particularly stark for BUN and creatinine, where the LLM had over 90% recall (compared to ~70% for the regex baseline). One driver of this disparity was that BUN and creatinine were often noted together (e.g., “BUN/Cr 10/1.5”); the LLM was able to extract such values, though the regex baseline could not. Notably, the LLM approach was also nearly as precise as the regex baseline. This performance is impressive, as the regex baseline is an extremely precise approach by construction (as it looks for specific phrases in the note).

The hallucination heuristics proved to be a key driver of the LLM approach’s high precision. Figure 3 demonstrates the marginal contribution of each heuristic to the precision of the five-shot LLM approach. While the raw five-shot LLM output has greater than 90% precision for all variables, the hallucination heuristics were able to increase this precision to over 95%. The heuristics were particularly impactful for BUN, Creatinine, and Tbili, which were more commonly hallucinated by the LLM than the other variables.

Test Set. Figure 4 demonstrates the strong performance of the five-shot LLM approach on the unseen test set. The LLM extractions displayed accuracies between 93 and 98%. The LLM greatly improved recall and accuracy over the regex baseline; for example, the LLM approach had a recall of 92% for creatinine, compared to 59% for the baseline. As in the validation set, the LLM approach also demonstrated high precision for all eight variables.

Downstream Value. We now summarize the value of the extracted numeric values for downstream data analysis. We first emphasize that the LLM extraction greatly increased the completeness of the data. Table 1 summarizes the number of potential donors who had at least one observation of each variable using 1) the tabular data only, 2) the tabular data with regex extractions, and 3) the tabular data with LLM extractions. The LLM extractions greatly improved data completeness; for example, using LLM extractions to supplement the tabular data provided creatinine measurements for an additional 8,064 potential donors (compared to 5,788 additions for the regex approach).

Table 1.

Number of potential donors with at least one extracted measurement for each variable.

Measurement	Tabular Data Only	Tabular Data + Regex	Tabular Data + LLM
BUN	3,984	8,252	11,007
Creatinine	3,986	9,774	12,050
AST	3,754	9,276	10,918
ALT	3,755	9,302	10,901
Tbili	3,737	8,969	10,059
HR	3,482	7,041	7,081
SystolicBP	3,478	7,735	8,009
DiastolicBP	3,478	7,413	7,731

Open in a new tab

This additional data had a direct impact on the data analysis task. Table 2 displays the estimated regression coefficients for BUN, creatinine, AST, and Tbili on the outcome of the OPO approaching a potential donor’s family. Fitting the regression using the tabular data alone, we were able to detect significant associations (at a 0.05 level) only for creatinine and AST. If we added in the data extracted by the regex approach, we were able to detect significant associations for BUN in addition to creatinine and AST. Finally, if we used the LLM extracted values, we were able to identify significant associations for all four variables (i.e., BUN, creatinine, AST, and Tbili). This finding aligns with clinical expertise, since all four of these values convey important information about kidney and liver viability that the OPO likely considers when deciding whether to approach a family. The LLM was thus able to increase the estimation precision of the data analysis by reducing the number of potential donors with missing data. Note, however, that this task is only intended to show the benefit of additional data completeness; using these estimates to inform real-world practice on organ procurement would require a more careful exposition.

Table 2.

Logistic regression coefficients for associations between the OPO’s decision to approach a potential donor’s family and four extracted variables (BUN, creatinine, AST, and Tbili). We display estimated coefficients with standard errors in parentheses. Bolded results were statistically significant at a 0.05 level.

	Tabular Data Only	Tabular Data + Regex	Tabular Data + LLM
BUN	0.00 (0.004)	-0.02 (0.003)	-0.02 (0.003)
Creatinine	-0.24 (0.041)	-0.09 (0.022)	-0.16 (0.031)
AST	-0.14 (0.039)	-0.11 (0.032)	-0.08 (0.029)
Tbili	0.02 (0.015)	0.00 (0.006)	-0.04 (0.017)

Open in a new tab

Discussion and Conclusions

We have shown how large language models can be used as powerful tools to extract numeric information from unstructured clinical text. Our proposed approach demonstrates high accuracy, precision, and recall on a hand-annotated set of notes, and improves estimation precision in a downstream data analysis task. While LLMs have been widely hyped as a transformative tool for healthcare,¹⁷ actual applications of LLMs in real-world clinical settings remain limited. Our work provides a practical application for LLMs in healthcare that can be combined with existing informatics methods to facilitate novel research in impactful domains.

Our work has several limitations. First, LLMs are black-box models and may fail in unpredictable ways. For example, the LLM labeled the variable ALT as ALY in a handful of extractions. While many such errors can be corrected in post-processing, they are still worth highlighting as potential failure modes. Second, elements of our hallucination heuristics are task-specific (e.g., the range of creatinine values we consider to be plausible). However, the principles underlying the heuristics (e.g., censoring implausible outputs) are generalizable and can be applied to any information extraction task (clinical or otherwise). Third, our work focused on extracting eight numeric variables with one LLM from a single dataset. Generalizing our approach to more clinical variables and LLMs with more parameters and different training procedures is an important direction for future work. Assessing the performance of LLMs trained on clinical data–such as Clinical Camel¹⁸ and MedAlpaca¹⁹–is an interesting next step. Similarly, broadly testing our method on datasets in other clinical domains is a key future direction.

Finally, we discuss the challenges of implementing our method at scale. LLMs are computationally expensive to deploy. We minimized the computational cost of our method by using a relatively small LLM loaded in half-precision, ensuring that it is applicable in many clinical research settings. However, running our method on thousands of notes in a reasonable time frame (i.e., hours, not days) still requires access to a GPU, which may not be universally available. A potential solution is a hybrid Regex-LLM approach; for example, only using an LLM for those notes where regex does not extract a (reasonable) value. This approach could drastically lower compute cost compared to the LLM-only approach and increase recall compared to the regex-only approach. Designing a hybrid solution that is effective in practice is an important direction for future work.

Acknowledgements

This research was supported, in part, by Takeda Development Center Americas, INC. (successor in interest to Millennium Pharmaceuticals, INC.). This research was supported in part by the MIT Racism Research Fund.

Footnotes

Based on U.S. Organ Procurement and Transplantation Network data as of February 10, 2024

Author Contributions

Hammaad Adam contributed to the conceptualization, methodology, formal analysis, investigation, and visualization of this study and the writing of this manuscript. All other co-authors (Junjing Lin, Jianchang Lin, Hillary Keenan, Ashia Wilson, and Marzyeh Ghassemi) contributed to research supervision and funding acquisition.

Student Information

Hammaad Adam is a PhD candidate at the Institute for Data, Systems, and Society (IDSS) at the Massachusetts Institute of Technology (MIT), located at 50 Ames St, Cambridge, MA 02142, USA. His primary advisor is Dr. Marzyeh Ghassemi, an Associate Professor at MIT.

Figures & Tables

References

1.Poulos J, Zhu L, Shah AD. Data gaps in electronic health record (EHR) systems: An audit of problem list completeness during the COVID-19 pandemic. Int J Med Inform. 2021 Jun;150:104452. doi: 10.1016/j.ijmedinf.2021.104452. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Liu S, Wang L, Ihrke D, Chaudhary V, Tao C, Weng C, et al. Correlating Lab Test Results in Clinical Notes with Structured Lab Data: A Case Study in HbA1c and Glucose. AMIA Jt Summits Transl Sci Proc. 2017 Jul 26;2017:221–8. [PMC free article] [PubMed] [Google Scholar]
3.Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, et al. Clinical information extraction applications: A literature review. J Biomed Inform. 2018 Jan;77:34–49. doi: 10.1016/j.jbi.2017.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Adamson B, Waskom M, Blarre A, Kelly J, Krismer K, Nemeth S, et al. Approach to machine learning for extraction of real-world data variables from electronic health records. Front Pharmacol. 2023 Sep 15;14:1180962. doi: 10.3389/fphar.2023.1180962. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.McInerney D, Young G, van de Meent JW, Wallace B. CHiLL: Zero-shot Custom Interpretable Feature Extraction from Clinical Notes with Large Language Models. In: Bouamor H, Pino J, Bali K, editors. Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics; 2023. pp. p. 8477–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. In: Goldberg Y, Kozareva Z, Zhang Y, editors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022. pp. p. 1998–2022. [Google Scholar]
7.Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022 Dec 6;35:27730–44. [Google Scholar]
8.Adam H, Suriyakumar V, Pollard T, Moody B, Erickson J, Segal G, et al. Organ retrieval and collection of health information for donation (ORCHID) [Internet] PhysioNet. 2023. Available from: https://physionet.org/content/orchid/1.0.0/ [DOI] [PMC free article] [PubMed]
9.Fernandez M. Organ procurement organization data to be analyzed for first time [Internet] Axios. 2021. [cited 2024 Feb 9]. Available from: https://www.axios.com/2021/10/06/organ-procurement-data-to-be-analyzed-fas.
10.Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models [Internet] arXiv [cs.CL] 2023. Available from: http://arxiv.org/abs/2307.09288.
11.Hosten AO. Butterworths: 1990. BUN and Creatinine. [PubMed] [Google Scholar]
12.Hall P, Cash J. What is the real function of the liver “function” tests? Ulster Med J. 2012 Jan;81(1):30–6. [PMC free article] [PubMed] [Google Scholar]
13.McKeown DW, Bonser RS, Kellum JA. Management of the heartbeating brain-dead organ donor. Br J Anaesth. 2012 Jan;108(Suppl 1):i96–107. doi: 10.1093/bja/aer351. [DOI] [PubMed] [Google Scholar]
14.Dernoncourt F, Lee JY, Szolovits P. NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. In: Specia L, Post M, Paul M, editors. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Copenhagen, Denmark: Association for Computational Linguistics; 2017. pp. p. 97–102. [Google Scholar]
15.Dong Q, Li L, Dai D, Zheng C, Wu Z, Chang B, et al. A Survey on In-context Learning [Internet] arXiv [cs.CL] 2022. Available from: http://arxiv.org/abs/2301.00234.
16.Zhang Y, Li Y, Cui L, Cai D, Liu L, Fu T, et al. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models [Internet] arXiv [cs.CL] 2023. Available from: http://arxiv.org/abs/2309.01219.
17.Webster P. Six ways large language models are changing healthcare. Nat Med. 2023 Dec;29(12):2969–71. doi: 10.1038/s41591-023-02700-1. [DOI] [PubMed] [Google Scholar]
18.Toma A, Lawler PR, Ba J, Krishnan RG, Rubin BB, Wang B. Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding [Internet] arXiv [cs.CL] 2023. Available from: http://arxiv.org/abs/2305.12031.
19.Han T, Adams LC, Papaioannou JM, Grundmann P, Oberhauser T, Löser A, et al. MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data [Internet] arXiv [cs.CL] 2023. Available from: http://arxiv.org/abs/2304.08247.

[r1-4351] 1.Poulos J, Zhu L, Shah AD. Data gaps in electronic health record (EHR) systems: An audit of problem list completeness during the COVID-19 pandemic. Int J Med Inform. 2021 Jun;150:104452. doi: 10.1016/j.ijmedinf.2021.104452. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2-4351] 2.Liu S, Wang L, Ihrke D, Chaudhary V, Tao C, Weng C, et al. Correlating Lab Test Results in Clinical Notes with Structured Lab Data: A Case Study in HbA1c and Glucose. AMIA Jt Summits Transl Sci Proc. 2017 Jul 26;2017:221–8. [PMC free article] [PubMed] [Google Scholar]

[r3-4351] 3.Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, et al. Clinical information extraction applications: A literature review. J Biomed Inform. 2018 Jan;77:34–49. doi: 10.1016/j.jbi.2017.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4-4351] 4.Adamson B, Waskom M, Blarre A, Kelly J, Krismer K, Nemeth S, et al. Approach to machine learning for extraction of real-world data variables from electronic health records. Front Pharmacol. 2023 Sep 15;14:1180962. doi: 10.3389/fphar.2023.1180962. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5-4351] 5.McInerney D, Young G, van de Meent JW, Wallace B. CHiLL: Zero-shot Custom Interpretable Feature Extraction from Clinical Notes with Large Language Models. In: Bouamor H, Pino J, Bali K, editors. Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics; 2023. pp. p. 8477–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6-4351] 6.Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. In: Goldberg Y, Kozareva Z, Zhang Y, editors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022. pp. p. 1998–2022. [Google Scholar]

[r7-4351] 7.Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022 Dec 6;35:27730–44. [Google Scholar]

[r8-4351] 8.Adam H, Suriyakumar V, Pollard T, Moody B, Erickson J, Segal G, et al. Organ retrieval and collection of health information for donation (ORCHID) [Internet] PhysioNet. 2023. Available from: https://physionet.org/content/orchid/1.0.0/ [DOI] [PMC free article] [PubMed]

[r9-4351] 9.Fernandez M. Organ procurement organization data to be analyzed for first time [Internet] Axios. 2021. [cited 2024 Feb 9]. Available from: https://www.axios.com/2021/10/06/organ-procurement-data-to-be-analyzed-fas.

[r10-4351] 10.Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models [Internet] arXiv [cs.CL] 2023. Available from: http://arxiv.org/abs/2307.09288.

[r11-4351] 11.Hosten AO. Butterworths: 1990. BUN and Creatinine. [PubMed] [Google Scholar]

[r12-4351] 12.Hall P, Cash J. What is the real function of the liver “function” tests? Ulster Med J. 2012 Jan;81(1):30–6. [PMC free article] [PubMed] [Google Scholar]

[r13-4351] 13.McKeown DW, Bonser RS, Kellum JA. Management of the heartbeating brain-dead organ donor. Br J Anaesth. 2012 Jan;108(Suppl 1):i96–107. doi: 10.1093/bja/aer351. [DOI] [PubMed] [Google Scholar]

[r14-4351] 14.Dernoncourt F, Lee JY, Szolovits P. NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. In: Specia L, Post M, Paul M, editors. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Copenhagen, Denmark: Association for Computational Linguistics; 2017. pp. p. 97–102. [Google Scholar]

[r15-4351] 15.Dong Q, Li L, Dai D, Zheng C, Wu Z, Chang B, et al. A Survey on In-context Learning [Internet] arXiv [cs.CL] 2022. Available from: http://arxiv.org/abs/2301.00234.

[r16-4351] 16.Zhang Y, Li Y, Cui L, Cai D, Liu L, Fu T, et al. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models [Internet] arXiv [cs.CL] 2023. Available from: http://arxiv.org/abs/2309.01219.

[r17-4351] 17.Webster P. Six ways large language models are changing healthcare. Nat Med. 2023 Dec;29(12):2969–71. doi: 10.1038/s41591-023-02700-1. [DOI] [PubMed] [Google Scholar]

[r18-4351] 18.Toma A, Lawler PR, Ba J, Krishnan RG, Rubin BB, Wang B. Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding [Internet] arXiv [cs.CL] 2023. Available from: http://arxiv.org/abs/2305.12031.

[r19-4351] 19.Han T, Adams LC, Papaioannou JM, Grundmann P, Oberhauser T, Löser A, et al. MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data [Internet] arXiv [cs.CL] 2023. Available from: http://arxiv.org/abs/2304.08247.

PERMALINK

Clinical Information Extraction with Large Language Models: A Case Study on Organ Procurement

Hammaad Adam, MS

Junjing Lin, PhD

Jianchang Lin, PhD

Hillary Keenan, PhD

Ashia Wilson, PhD

Marzyeh Ghassemi, PhD

Abstract

Introduction

Methods