Skip to main content
BMJ Open Access logoLink to BMJ Open Access
. 2024 Dec 20;30(6):385–389. doi: 10.1136/bmjebm-2024-113199

From promise to practice: challenges and pitfalls in the evaluation of large language models for data extraction in evidence synthesis

Gerald Gartlehner 1,2,, Leila Kahwati 1, Barbara Nussbaumer-Streit 2, Karen Crotty 1, Rainer Hilscher 1, Shannon Kugley 1, Meera Viswanathan 1, Ian Thomas 1, Amanda Konet 1, Graham Booth 1, Robert Chew 1
PMCID: PMC12703319  PMID: 39797673

Background

The introduction of generative large language models (LLMs) has led to the exploration of their use in evidence synthesis tasks, such as literature searches,1 screening,2 3 risk of bias assessment,4 5 data extraction,6–9 statistical analysis8 and writing plain language summaries.10 Of these tasks, data extraction (ie, transferring data from reports of primary studies into standardised tables) is a crucial, yet time-consuming11 and error-prone12 step in the evidence synthesis process.

Unlike earlier natural language processing technologies used for data extraction, LLMs do not require labelled training data, making them accessible to users without a technical background. To date, few studies have assessed the use of LLMs for data extraction, and these studies have yielded mixed results compared with human reference standards.1 6 7 13 14

In this commentary, we discuss the challenges and pitfalls researchers face when assessing the performance of an LLM for data extraction in validation studies. Our insights stem from an initial proof of concept study,7 and ongoing workflow validation study focused on employing an LLM for semi-automated data extraction.15 These challenges include selecting an appropriate reference standard, avoiding data contamination, choosing suitable outcome metrics and defining what qualifies as an error. We will also reflect on prompt engineering and the practical challenges associated with using LLMs in validation research.

The choice of validation design determines the relevance of results

Evaluating LLMs for data extraction in evidence synthesis involves two distinct approaches that investigators must consider at the outset of planning a validation study: model validation or workflow validation in real-world review settings.

Model validation studies examine the performance and reliability of an LLM under specific, controlled conditions. These studies are essential for evaluating the model’s ability to extract different types of data and identifying its strengths and limitations. They determine whether data extraction using an LLM is feasible and typically rely on carefully controlled datasets to compare human versus machine performance. If the selection of example texts is representative, model validation studies can provide generalisable results.16

In contrast, workflow validation studies integrate the LLM into the workflow of an ongoing systematic review, providing a detailed perspective on the model’s practical effectiveness, efficiency and utility. These studies can be designed as a study within a review (SWAR),17 allowing researchers to observe and document the model’s performance in real-world scenarios, extending beyond theoretical capabilities or controlled environments.

Model validation studies are less expensive and serve as necessary prerequisites for workflow validation studies. While workflow validation studies are more complex, they provide richer insights and ultimately determine the utility of an LLM. For instance, an LLM might match human accuracy in data extraction tasks yet prove unsuitable for practical application due to the time required to successfully integrate it into the workflow. A limitation of workflow validation studies is their potentially restricted generalisability. This limitation arises from the unique characteristics and potential biases present in the specific study publications used for a given review. The performance and correctness of an LLM in extracting data from one set of documents may not necessarily translate to equal performance across different domains or study designs.

The choice of the reference standard may lead to benchmark bias or data contamination

Human-led data extractions usually serve as the reference standard to evaluate an LLM’s performance in data extraction. However, two potential biases—benchmark bias and data contamination—need to be considered.

Benchmark bias, a form of information or classification bias,18 arises when the reference used in a validation study is imperfect. In the case of human-led data extraction, research indicates that up to 63% of studies extracted by humans contain at least one data extraction error.12 In our recent proof-of-concept study, Claude 2 made one major and five minor errors (see definitions in table 1) but detected 21 minor errors in the human-led reference standard.7 Thus, considering human data extraction from published systematic reviews as the gold standard introduces bias by presuming discrepancies are errors made by the LLM, potentially underestimating the model’s capabilities. Therefore, refining the reference standard in validation studies of data extraction is essential.

Table 1.

Classification of differences when assessing LLM-assisted data extraction processes

Type of difference Definitions
Missed or omitted data Data that were available in the study report but were either missed or omitted by the extraction process.
Fabricated data Data that were not available in the study report were either inaccurately filled in by human data extractors or erroneously generated (hallucinated) by the LLM and not revised/removed by the human conducting the correctness check.
Misallocated data Data that were provided in the study report were allocated to the wrong data element field.
Incorrect calculations Mathematically incorrect calculations of data elements based on information provided in the study report, including rounding errors.
Difference in level of detail The level of detail is the only difference and is not relevant to drawing conclusions.
Other Optional ‘other’ field if none of the above apply
Potential impact of difference
Major error This error significantly compromises the correctness of the data, and, if uncorrected, could lead to erroneous conclusions; for example, grossly incorrect calculations, misallocated data that results in a different interpretation or uncorrected hallucinations of the LLM that result in a new or different interpretation of the data.
Minor error This error is less severe than a major error and may or may not impact interpretation of the existing data; for example, small calculation errors or rounding errors that do not critically affect the data’s overall utility,
Inconsequential difference This difference would not impact the interpretation of the data; for example, additional or alternative language describing study population or the intervention that doesn’t inherently alter the meaning.

LLM, large language model.

In validation studies that do not use a labelled corpus (ie, a dataset where each data item has been annotated with specific labels, serving as a reference to assess the accuracy), one effective method is verifying each discrepancy between human and LLM extractions against the original study report, ideally by a blinded adjudicator. This approach can expose human errors, allowing for corrections in the reference standard and providing a more accurate assessment of LLM performance.

Data contamination occurs if the dataset used for performance evaluation also contributed to training the LLM.19 For example, using data from open-access systematic reviews to assess an LLM’s data extraction capability risks contamination if the model previously encountered and ‘memorised’ this data during training. This could artificially enhance its performance on familiar material. The practical challenge for investigators is that developers of LLMs typically do not disclose their training data. However, it is conceivable that databases with numerous open-access systematic reviews, such as the Cochrane Library, may be used to train LLMs. Similarly, publicly available labelled corpora of study data, such as Evidence-based Medicine-Natural Language Processing,20 could be used, either intentionally or unintentionally, for LLM training.

Although the extent to which data contamination biases LLM evaluations is uncertain, investigators can implement several strategies to minimise this risk. These strategies include using datasets from unpublished reviews, subscription-based systematic reviews or studies published after the LLM’s most recent training update. For instance, as of this writing, OpenAI’s GPT-4o has been trained on data up to October 2023, and Claude 3 up to August 2023. Datasets from open-access reviews published after an LLM’s training date should not be subject to data contamination. However, these strategies may, in turn, limit the generalisability of the validation study’s results. Additionally, the risk of data contamination is likely minimal for workflow validation studies, which typically involve a prospective comparison of extracted data, reducing the likelihood of prior model exposure to the data.

Challenges in defining incorrect data extractions and adjudicating differences between extracted data elements

Even with a reference standard corrected for human errors, adjudicating differences between a fully human extraction process and one assisted by an LLM remains challenging. The definition of ‘correct’ can vary even among human extractors. Data extraction often involves decisions about format, style, relevance, the level of detail and reasonable inferences from ambiguous data.

Validation studies typically focus on correctness of extraction of individual data elements, which is easiest to assess for data elements that are narrowly defined. Extractions of data elements that are defined broadly (eg, ‘study population’) may vary between humans and LLMs, but these are usually not factual errors, instead, they are justifiable interpretations of the contents of the original study report. A similar variability in data extractions can also be seen when comparing human-to-human extractions. One way to approach this issue is to conceptualise differences in data extractions as concordant and discordant rather than simply correct and incorrect. Our team has employed the following working definition for concordance: Concordance is factual congruence of extracted data items, even if there are variations in style, presentation or length between the two data extractions.

Another challenge is that data extractors are often required to classify, summarise or record data in ways that involve judgement. For instance, identifying the active ingredients of a behavioural intervention or classifying a specific outcome into a domain (eg, quality of life, pain, function). In such cases, the ‘correct’ answer is often determined by the judgement of the most senior reviewer, not the study report. Additionally, extractors sometimes calculate values based on data reported in study reports, such as determining the percentage of women in the study population from the reported frequency of women and the overall sample size. These calculations occasionally require judgement, such as deciding which number to use as a denominator for the overall study population (eg, number randomised or number analysed).

Our approach to assessing correctness of data extraction has evolved from merely counting the ‘errors’ made by an LLM-assisted process to evaluating the factual concordance of the data.

When extracted data do not factually agree, we consult the original study report to adjudicate the difference. We also classify the type and severity of the difference to enhance our prompt engineering and provide better guidance to human and LLM extractors for future extractions. Our current classification scheme is depicted in table 1.

Selecting appropriate outcomes and defining the unit of analysis

The choice of validation design—whether model validation or workflow validation—determines the types of outcomes that can be assessed. Model validation studies primarily focus on correctness of data extraction. In contrast, the prospective design of workflow validation studies allows for the evaluation of additional relevant outcomes which determine the utility of a data extraction tool, such as task time or the downstream effects of data extraction errors on conclusions.

Accuracy and unit of analysis

Various metrics are available to quantify correctness of an LLM’s data extraction. Table 2 summarises the advantages and disadvantages of four commonly used metrics for assessing the machine learning performance: precision, recall, accuracy and F1 score. Because the performance of LLMs can vary substantially across data types, an overall quantitative evaluation is usually less meaningful than an evaluation by data category (eg, participants characteristics, outcomes) or individual items.9

Table 2.

Commonly used metrics for quantifying LLM accuracy when used for data extraction

Metric Definition Range of results Strength Limitation
Precision (=positive predictive value) The accuracy of an LLM on the data items for which it returned an extracted value.
TP(TP+FP)
0–1 (where 0 is poor and 1 is excellent); results can also be expressed as a percentage Widely used Can be misleading if considered independently.
Does not take missed data into consideration and can be high even if the LLM missed information on a substantial number of data elements (ie, it has low recall).
Recall (=sensitivity) The ability of an LLM to correctly extract available data items.
TP(TP+FN)
0–1 (where 0 is poor and 1 is excellent); results can also be expressed as a percentage Widely used Can be misleading if considered independently.
Does not reflect the extent of fabricated data for items for which no data were available (ie, hallucinated data).
Accuracy (=percent agreement) The percentage of correct data extractions, out of all data elements.
(TP+TN)(TP+FP+TN+FN)×100
0–100 (where 0 is poor and 100 is excellent); results can also be expressed as a decimal Straightforward summary of an LLMs performance in correctly extracting data and correctly identifying elements for which no data are reported. Fails to distinguish between fabricated data and missed data.
May not be a reliable metric when dealing with imbalanced data distributions.
F1 score An evaluation metric that combines the precision and recall (via harmonic mean) into a single statistic.
2(PrecisionRecall)(Precision+Recall)
0–1 (where 0 is poor and 1 is excellent)
The greater the disparity between precision and recall scores, the lower the F1 score
Ensures that both fabricated data (false positives) and missed data (false negatives) are considered separately.
More sensitive to imbalanced data than accuracy.
Not informative about the distribution of errors, as it provides a single value that summarises the model’s performance across both precision and recall.

FN, false negatives: the number of data items missed or incorrectly extracted by the LLM from the full text publication; FP, false positives: the numberof data items for which the LLM provided fabricated data when no data were available in the full text publication (i.e., hallucinated data); LLM, large language model; TN, truenegatives: the number of data items that the LLM correctly identified as not available in the full text publication; TP, true positives: the number of data items correctly extracted bythe LLM from the full text publication.

Closely related is the choice of the unit of analysis. Data items extracted for evidence synthesis vary in complexity, ranging from simple single numbers (eg, trial registration number) to more complex elements that consist of multiple values (eg, effect estimates with 95% CIs). Consequently, composite data elements are more susceptible to extraction errors than single-value data elements. For instance, consider a risk ratio with a 95% CI. If the point estimate and the upper and lower bounds of the CI are incorrect, these would constitute three separate errors and potentially distort the results. However, if the unit of analysis is defined as the composite data item (ie, the point estimate with the 95% CI), these multiple incorrect values would be counted as a single error. Therefore, defining data items and the appropriate unit of analysis in the study protocol is important to ensure consistent error counts across data elements.

Inter-rater reliability

Kappa-type statistics (eg, Cohen’s kappa or Gwet’s Agreement Coefficient 1 (AC1)) are useful metrics for assessing inter-rater agreement. These statistics assume that some proportion of the observed agreement between raters is due to chance rather than true agreement. In validation studies, chance agreement can be problematic when extractors choose between predefined categories, such as distinguishing between randomised and non-randomised studies. Although kappa-type statistics can be applied to qualitative data that vary in levels of detail, such as descriptions of study populations, they often require simplification of complex information into categories. This process and the resulting statistics often fail to capture the nuances of interpretation that are crucial in interpretation of qualitative data. Additionally, chance agreements are not a concern when extracting numerical data from study reports. This is because the probability of randomly guessing the exact numerical value that matches the true data is extremely low.

Time spent on task

A primary objective of employing LLMs for data extraction is to enhance the efficiency of the systematic review process. Time spent on task is a relevant outcome for evaluating this efficiency. Although LLMs demonstrate rapid data extraction capabilities from study reports, additional time is needed for prompt engineering—a requirement absent in human-only data extraction approaches. To accurately assess whether integrating LLMs enhances efficiency, it is essential to measure the time invested in all associated tasks. Some tasks, such as prompt engineering, are time-intensive only once and may become less demanding as prompt libraries develop.

Impact on conclusions: determining downstream effects of data extraction inaccuracies

The consequences of inaccuracies in different data elements can vary, potentially impacting a review’s conclusions differently. For instance, the omission of data from a critical outcome could significantly sway the overall conclusion. Conversely, the absence of data from a single, inconsequential study might have a negligible impact on the overarching results and conclusions. Therefore, in-depth case studies comparing different ways of incorporating LLMs with human-only methodologies should extend beyond evaluating the correctness of data extraction alone. They should delve into how any inaccuracies might affect the synthesis (eg, meta-analyses) and the ultimate conclusions of a systematic review.

The prompts determine the output of the LLM

Prompt engineering involves designing text inputs (prompts) with the objective of accurate and succinct output from LLMs. The prompt provided to an LLM directly impacts response correctness, completeness and format. For data extraction, if the prompt does not contain sufficient information and context to complete a task, the LLM is more likely to generate incomplete, erroneous or hallucinated responses. For example, prompts for LLM data extraction tend to include instructions, field definitions and the document text as part of the prompt. Models with shorter context lengths may not be able to fit the entire document in the prompt, requiring additional text parsing that can reduce performance.13

When developing prompts for LLM-assisted data extraction, iteration on the prompt text is necessary for obtaining accurate results.9 Before extracting information, researchers should conduct a pilot phase in which prompts are developed, tested and evaluated on a subset of articles. This will prevent the temptation to develop unique prompts for each article that may not generalise well to new articles.

Although the pilot phase can be done informally, setting up an evaluation framework will help researchers quantitatively determine whether changes to prompts are improving outcomes or not. At a minimum, the evaluation criteria used in the pilot phase should mimic the criteria intended for the full study to provide the research team with the feedback necessary to iterate on prompts. Other criteria that may be useful to consider testing for in the pilot phase include response format, length and level of detail.

Test-retest reliability

Test-retest reliability is a crucial component of model validation studies for LLMs, as it measures output consistency over time. To ensure accurate test-retest assessments in data extraction, it is vital to use the same prompts and study reports for the data extraction process.

LLMs inherently incorporate stochastic elements, enabling them to produce varied and creative responses. They predict the likelihood of word sequences using the context from preceding words. During this process, LLMs frequently employ sampling methods to select subsequent words in a sequence. Consequently, even with identical prompts, the model can produce different outputs on different runs.9

For example, in our proof-of-concept study, testing an LLM’s performance in data extraction, the number of errors was similar between the original run and a repeated run 4 weeks later with the same prompts and study reports. However, the errors occurred in different data elements in five out of six cases.7

Practical challenges

Validation studies for data extraction also face various practical obstacles. One significant obstacle is the rapid pace of LLM development. By the time a study is completed, the LLM under evaluation may have been replaced by a newer model. To address this, investigators can design workflow validation as an adaptive SWAR, allowing the option to switch to a new model if it offers advantages such as a larger context window. Second, LLM rate restrictions, which vary based on factors like institutional traffic, can severely restrict workflow. Subscription-based versions of the LLM or Application Programming Interface usually help mitigate these rate restrictions. Third, human variation—differences in systematic review experience, data extraction detail, team differences in validating data extractions and proficiency in engineering prompts—can significantly impact validation study results. Variation in human expertise and team differences should be carefully considered when interpreting workflow validation study results. Large SWARs with multiple review teams can provide insights into the variability of the human reference standard. Furthermore, the choice of the topic for validation can impact results. Randomised trials of simple pharmacologic interventions may be easier for both humans and machines to accurately extract than studies of behavioural or complex implementation interventions, or non-randomised designs. The quality of reporting in study publications can significantly affect the accuracy of data extraction, both by humans and machines. However, it is important to explore variations in correctness and utility across the spectrum of evidence synthesis topics and study designs.

Footnotes

Contributors: GG, KC, MV and LK contributed to conceptualisation. GG and KC contributed to funding acquisition. GB contributed to project administration. GG, LK, BN-S, RC and AK contributed to writing original draft. KC, MV, LK, RH, BN-S, SK, IT, AK and GB contributed to review and revisions of draft . We used AI for editing purposes.

Funding: The underlying research for this commentary was supported by internal resources of RTI International (https://www.rti.org) through the Innovation Fund, and Cochrane Austria (BNS). Effort from LK and MV was supported by the RTI Fellows Program.

Competing interests: The authors declare no competing interests. They emphasise that they have no financial investments in companies developing LLMs or commercial software using LLMs, nor do they collaborate with LLM providers.

Provenance and peer review: Not commissioned; externally peer reviewed.

Ethics statements

Patient consent for publication

Not applicable.

Ethics approval

Not applicable.

References

  • 1. Qureshi R, Shaughnessy D, Gill KAR, et al. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Syst Rev 2023;12:72. 10.1186/s13643-023-02243-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Tran V-T, Gartlehner G, Yaacoub S, et al. Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses. Ann Intern Med 2024;177:791–9. 10.7326/M23-3389 [DOI] [PubMed] [Google Scholar]
  • 3. Cai X, Geng Y, Du Y, et al. Utilizing chatgpt to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation. Epidemiology [Preprint] 2023. 10.1101/2023.09.06.23295072 [DOI] [PMC free article] [PubMed]
  • 4. Hasan B, Saadi S, Rajjoub NS, et al. Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment. BMJ EBM 2024;29:394–8. 10.1136/bmjebm-2023-112597 [DOI] [PubMed] [Google Scholar]
  • 5. Lai H, Ge L, Sun M, et al. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Netw Open 2024;7:e2412687. 10.1001/jamanetworkopen.2024.12687 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Dagdelen J, Dunn A, Lee S, et al. Structured information extraction from scientific text with large language models. Nat Commun 2024;15:1418. 10.1038/s41467-024-45563-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Gartlehner G, Kahwati L, Hilscher R, et al. Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Res Synth Methods 2024;15:576–89. 10.1002/jrsm.1710 [DOI] [PubMed] [Google Scholar]
  • 8. Reason T, Benbow E, Langham J, et al. Artificial Intelligence to Automate Network Meta-Analyses: Four Case Studies to Evaluate the Potential Application of Large Language Models. Pharmacoecon Open 2024;8:205–20. 10.1007/s41669-024-00476-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Graziozi S, Campbell F, Kapp C, et al. Exploring the use of a large language model for data extraction in systematic reviews. 2024. Available: https://arxiv.org/abs/2405.14445 [Accessed 12 Sep 2024].
  • 10. Ovelman C, Kugley S, Gartlehner G, et al. The use of a large language model to create plain language summaries of evidence reviews in healthcare: A feasibility study. Cochrane Ev Synth Methods 2024;2:e12041. 10.1002/cesm.12041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Nussbaumer-Streit B, Ellen M, Klerings I, et al. Resource use during systematic review production varies widely: a scoping review. J Clin Epidemiol 2021;139:287–96. 10.1016/j.jclinepi.2021.05.019 [DOI] [PubMed] [Google Scholar]
  • 12. Mathes T, Klaßen P, Pieper D. Frequency of data extraction errors and methods to increase data extraction quality: a methodological review. BMC Med Res Methodol 2017;17:152. 10.1186/s12874-017-0431-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Konet A, Thomas I, Gartlehner G, et al. Performance of two large language models for data extraction in evidence synthesis. Res Synth Methods 2024;15:818–24. 10.1002/jrsm.1732 [DOI] [PubMed] [Google Scholar]
  • 14. Mahmoudi H, Chang D, Lee H, et al. A Critical Assessment of Large Language Models for Systematic Reviews: Utilizing ChatGPT for Complex Data Extraction. SSRN J 2024. 10.2139/ssrn.4797024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Gartlehner G. SWAR 28: Semi-automated data extraction for evidence syntheses using Claude 2. 2024.
  • 16. Trevor Hastie RT, Friedman J. The Elements of Statistical Learning. 2nd edn. New York, NY: Springer, 2009. [Google Scholar]
  • 17. Devane D, Burke NN, Treweek S, et al. Study within a review (SWAR). J Evid Based Med 2022;15:328–32. 10.1111/jebm.12505 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Schmidt RL, Factor RE. Understanding sources of bias in diagnostic accuracy studies. Arch Pathol Lab Med 2013;137:558–65. 10.5858/arpa.2012-0198-RA [DOI] [PubMed] [Google Scholar]
  • 19. Carlini N, Ippolito D, Jagielski M, et al. Quantifying memorization across neural language models. 2022.
  • 20. Nye B, Li JJ, Patel R, et al. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1); Stroudsburg, PA, USA, Melbourne, Australia. 10.18653/v1/P18-1019 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from BMJ Evidence-Based Medicine are provided here courtesy of BMJ Publishing Group

RESOURCES