Using large language models for temporal relation extraction from pediatric clinical reports

Judith Jeyafreeda Andrew; Juliette Potier; Nicolas Garcelon; Anita Burgun; Marc Vincent

doi:10.1093/jamiaopen/ooaf121

. 2025 Nov 22;8(6):ooaf121. doi: 10.1093/jamiaopen/ooaf121

Using large language models for temporal relation extraction from pediatric clinical reports

Judith Jeyafreeda Andrew ^1,^2,^✉, Juliette Potier ³, Nicolas Garcelon ⁴, Anita Burgun ^5,^6,⁷, Marc Vincent ⁸

PMCID: PMC12640238 PMID: 41281245

Abstract

Objectives

To evaluate large language models (LLMs) for extracting temporal relations from pediatric rare disease clinical reports to enable automated patient timeline creation.

Materials and Methods

We developed a temporal relation extraction framework for electronic health records, using 25 clinical reports from a pediatric rare disease hospital. We implemented few-shot prompting with 3 different LLMs in secure environments.

Results

Our findings reveal that binary classification significantly outperforms multi-class approaches for temporal relation extraction, with best F1 scores reaching 0.70 for simpler relations while more complex relations remain challenging (F1: 0.03-0.40). Mistral 22B emerged as the strongest overall performer, though model superiority varied by relation type.

Discussion

The dramatic performance improvement from reducing cognitive load (binary vs multi-class classification) demonstrates that task formulation critically impacts LLM effectiveness in specialized clinical domains. Our few-shot approach successfully enables temporal relation extraction from French pediatric texts while maintaining data privacy through local deployment, offering a viable methodology for healthcare institutions with strict data governance requirements.

Conclusion

Our few-shot prompting approach demonstrated promising results in secure environments. This methodology allows technique sharing without exposing sensitive data, advancing research possibilities for clinical natural language processing in restricted settings.

Keywords: temporal relation extraction, rare diseases, patient timeline, large language model

Introduction

A clinical report is a type of health record extract that conveys focused healthcare information of the patient which is prepared by or on behalf of a clinician. The research on automatic extraction of information from clinical reports has evolved greatly over time with the development of Machine Learning and Natural Language Processing (NLP) techniques. In this article, we focus on a sub-task of NLP: Temporal Relation Extraction. In particular, our end goal is to extract temporal relations from clinical reports of patients with Rare Diseases, ie, diseases that affect less than 65 out of 100 000 people worldwide.¹ We assume that detailed histories of rare disease patients could be helpful to model the evolution of a given disease and make early diagnosis for new patients.

Although there are research works on identifying temporal relations from clinical texts, these works all focus on the detection of relations between events and temporal entities with no specific task in mind. In this paper, we focus on extracting all temporal relations between phenotypes and time entities with the aim of creating the patients timeline.

Also, while there are publicly available clinical reports² for research purposes, those are mainly in the English language. The language and format of the clinical reports influences the development of techniques based on large language models (LLMs) and may result in varying performances across different languages. Another specific focus of this paper is on the French Language.

Lastly, clinical reports may vary across specialties and hospitals. The clinical reports used for our experiments are made of unstructured text and contain the history of the patient, a summary of previous visits, detailed information related to the current visit, and decisions about next visits if required. (Necker Enfants Malades Hospital, AP-HP, 149 rue de Sèvres, 75015 Paris, France. The dataset is private and cannot be distributed. The dataset is de-identified for distribution within the Institute for research purposes; however, the French regulation about health individual data prevents us from making the dataset publicly available.) This might not always be the same content in other hospitals, making validation of LLMs with local datasets essential.

Thus, in this work, we use clinical reports from patients particular to the Necker Hospital in Paris, AP-HP, 149 rue de Sèvres, 75015 Paris, France for the extraction of temporal relations. Our contributions in this paper are as follows: (i) a short description of the types of relations that have been used for a manual annotation specific to the task of automatic generation of a patient’s timeline; (ii) using 3 LLMs for temporal relation extraction to study their performance and re-usability in a secure environment; and (iii) applied to private hospital data, which are clinical reports specific to pediatric genetic rare diseases in the French language. The main results have shown that there are certain LLMs that are better in recognizing certain relation types and overall LLMs are better with binary classification than multi-class classification.

Related work

Annotation guidelines for relation extraction

In their study, Campillos et al³ present MERLOT, a French private clinical corpus comprising 500 documents annotated with 44 740 entities and 26 478 relations. The annotation involved 6 annotators over 24 months, achieving an average inter-annotator agreement of 0.793 F-measure for entities and 0.789 for relations. Pustejovsky et al⁴ present TimeML, a specification language for annotating events and temporal expressions in natural language. It introduces 4 main data structures—EVENT, TIMEX3, SIGNAL, and LINK—allowing for robust temporal anchoring and relationships among events. In the study by Andrew et al,⁵ the annotation process focuses on identifying temporal entities within clinical texts, which is crucial for building patient timelines, especially in rare diseases where data are limited. The guidelines categorize temporal entities into Dates—including Date of Visit, Date of Report, Date of Past Visit, Date of Future Visit, Date of Birth, Time, Frequency, Duration, and Age. Following this, we will present specific annotation guidelines for Relation Extraction (RE), crucial to build a patient’s timeline. Inspired from the above literature works, we identified 7 relations that could occur between phenotypes and temporal relations—BEGINS-AT, ENDS-AT, CONTAINS, OVERLAP, BEFORE, BEFORE-OVERLAP, and SIMULTANEOUS—these relation types are discussed in detail in the upcoming sections.

Temporal relation extraction

Han et al⁶ present the OpenNRE toolkit, an open and extensible framework for neural relation extraction (RE). OpenNRE implements various neural modules and algorithms, including the BERT encoder, to support the creation of new RE models and offers typical RE models as examples for easy implementation and validation. The toolkit facilitates training custom models and quick validation, with experimental results demonstrating its effectiveness and comparability to original research implementations. Long et al⁷ present a model that combines rule-based methods with machine learning techniques, specifically Support Vector Machines and Recurrent Neural Networks, to identify TIMEX3 spans, EVENT attributes, and document-time relations. Lin et al⁸ present a BERT-based one-pass multi-task model for clinical temporal relation extraction, addressing inefficiencies in existing multi-pass methods. Raghavan et al⁹ introduce a methodology for ordering medical events in unstructured clinical narratives by learning to rank them based on their time of occurrence. Lin et al¹⁰ present a self-training framework that enhances Recurrent Neural Networks for temporal relation extraction in clinical texts. Lin et al¹¹ also present a BERT-based approach for clinical temporal relation extraction, where instead of limiting analysis to individual sentences, the authors use a fixed window of contiguous tokens (50-60 tokens) to capture both within-sentence and cross-sentence temporal relations. Bramsen et al¹² present a machine learning approach for temporal analysis of medical discharge summaries, focusing on temporal segmentation and ordering.

Large language models (LLMs) have been successfully used for Relation Extraction. Few-shot in-context learning entails incorporating a few training examples into model prompts, effectively “learning” via the activations induced by passing these examples through the network at inference time.¹³ The study by Wadhwa et al¹⁴ is one of the first attempts at using LLMs in RE. It investigates RE using GPT-3 and Flan-T5, under varying supervision levels, showing that few-shot prompting with GPT-3 achieves near state-of-the-art performance on standard RE datasets, outperforming fully supervised models. Flan-T5, while less effective in few-shot settings, achieves SOTA results when fine-tuned with chain-of-thought explanations generated by GPT-3. In this paper, we aim at using a few-shot approach to extracting temporal relations.

Challenges in temporal relation extraction for rare diseases

Temporal relation extraction is especially challenging in rare disease contexts due to the heterogeneity and limited availability of annotated data. Unlike common conditions, rare diseases often involve complex, atypical clinical timelines that are inconsistently documented across reports. This irregularity makes it harder for models to learn robust patterns.

French clinical texts add further complexity. Temporal expressions can be subtle and vary across institutions—for example, phrases like “à l’âge de” or “depuis la première consultation” require nuanced understanding that differs significantly from English phrasing. Most existing NLP tools are trained on English datasets, so cross-lingual transfer is limited in effectiveness without domain-specific adaptation.

While there exists multilingual models, without fine-tuning on French clinical data, their performance remains inconsistent. Our approach circumvents these limitations by using few-shot prompting in French with curated examples from Necker Hospital, maintaining model privacy while ensuring linguistic and clinical relevance.

Dataset

Thakur et al¹⁵ have stressed the need for external evaluation in the setting where the LLM models are to be deployed. In this work, we focus on clinical texts in the French language. This study was performed in the context of the C’IL-LICO project, a research project on ciliopathies. The C’IL-LICO project and study protocol received approval from the French National Ethics and Scientific Committee for Research, Studies and Evaluations in the Field of Health (CESREES) under the number#2201437. The data processing was approved by the French Data Protection Authority (CNIL) with a waiver of informed consent under number DR-2023-017//920398v1. Our dataset is a collection of patients’ clinical reports extracted from Dr Warehouse, the clinical data warehouse implemented at Necker Hospital.¹⁶ The features and capabilities of this data warehouse enable efficient use of NLP techniques in a secure environment. Following GDPR and CNIL guidelines, and as part of DrWarehouse features, the patient data are pseudonymized prior to its use in a study. The pseudonymization process is done by detecting mentions of direct identifiers in the text (name, address, phone number, hospital’s identification number, with the exception of date of birth which is relevant to the study) and replacing them with a placeholder. The detection procedure developed prior to this work in the context of DrWarehouse combines matching known identifiers from patient records, and applying machine learning models trained for Named Entity Recognition (NER) of the aforementioned identifier categories (F1 ranging from 0.95 to 1).

We have chosen 25 clinical reports for our experiments. The 25 clinical reports have been chosen such that the temporal entities, phenotypes, and relations are represented with a good amount of examples to be able to properly test the LLMs. The dataset is annotated with all-time entities and phenotypes. The time entities for the dataset are Date of Birth (DOB), Date of Report (DOR), Date of Visit (DOV), Date of Past Visit (DOPV), Date of Future Visit (DOFV), Other Dates (DOTHER), Age, Duration, Frequency, and Time. The clinical reports are pre-annotated following the short annotation guideline shown in Andrew et al.⁵ The phenotypes are also pre-annotated as per the Deep Learning technique shown in Vincent et al,¹⁷ where the authors use Gated Recurrent Units and Long Short Term Memory to identify phenotypes, leveraging fine-tuned word embeddings. Initially, embeddings are obtained using a skip-gram fastText model trained on a large collection of clinical reports. A second set of embeddings is created by fine-tuning a CamemBERT model, producing contextual embeddings.

In this paper, we present a short annotation guideline to annotate the temporal relation with the goal of generating a patient timeline. For the purpose of creating the prompt examples for an LLM, we choose examples from a separate set of 20 clinical reports annotated by 2 annotators, selected manually to provide prompting examples for the LLM.

The number of each temporal entity in every 1 of the 25 clinical reports used for testing the prompts is shown in a table in Appendix S1. Figure 1 shows the length of each of the 25 clinical reports, in terms of the number of words. Figure 2 and Table 1 show the number of each type of relations present between the temporal entities and phenotypes with a total of 706 relations in our test dataset of 25 clinical reports.

The plot shows the length of each file in terms of number of words. There are 25 files in the test set. — Length of each clinical report in terms of number of words.

A pie chart showing the percentage of relations per entity in 25 clinical reports. — Number of relations per entity in 25 clinical reports.

Table 1.

Number of relationship types for each entity.

Ent/Rel	BEGINS-AT	ENDS-AT	CONTAINS	OVERLAP	BEFORE-OVERLAP	BEFORE	SIMULTANEOUS
DOB →“Phenotype”	1	0	1	0	7	0	0
DOR →“Phenotype”	1	0	0	0	2	0	0
DOV →“Phenotype”	85	0	7	2	181	11	0
DOPV → “Phenotype”	50	4	1	38	59	10	0
DOFV → “Phenotype”	0	0	0	0	0	0	0
DOTHER →“Phenotype”	0	0	0	0	0	0	0
Age → “Phenotype”	9	1	1	14	9	0	0
Duration → “Phenotype”	22	0	9	10	58	0	11
Frequency →“Phenotype”	0	0	0	0	0	0	3
Time → “Phenotype”	34	2	3	16	41	3	0
Total	202	7	22	80	357	24	14

Open in a new tab

The relations are in the form of “Temporal Entity” → “Phenotype.”

Relation types

Defining temporal relations within the clinical context can be a difficult task, we build on previous works to do so, most notably the guidelines presented as part of the annotation of the MERLOT corpus.³ For our task of building a patient timeline, we define the following relations.

BEGINS-AT: when a Phenotype begins on the same time point as the one referred by a time expression.
ENDS-AT: when a Phenotype ends at the same time point as the one referred by a time expression.
BEFORE: when a Phenotype ends before the time point referred by the time expression.
BEFORE-OVERLAP: when a Phenotype occurs before a temporal entity and continues to occur during the timespan of the time expression.
SIMULTANEOUS: when a Phenotype and a time entity or 2 time entities share the same time span as a time expression.
OVERLAP: when the phenotype and temporal entity share some common time span but you do not know other information.
CONTAINS: when a time expression contains the beginning and the end of a phenotype.

Experimental set-up

Annotation process

For the purpose of testing our experiments, 2 annotators were asked to annotate (using the BRAT tool¹⁸) the same set of 25 clinical reports to establish a gold standard. The annotators are one linguist (J.P.) and one computer scientist (A.J.J.). The linguist is of French origin, who has done their masters in language science. The computer scientist of Indian origin with a PhD in Natural Language Processing, who has been working with clinical reports in French language for 2 years and has a B2 level of French Language. The annotations achieved an average F1 score Inter Annotator Agreement (IAA) of 0.84. The IAA for each relation type is detailed in the table in Appendix S1.

Large language models

For the extraction of temporal relations, we attempt a few-shot prompting approach using 3 different LLMs. For the purpose of maintaining the privacy of the data, we have used local installations of the LLMs. This allows us to maintain the privacy of the patients as no data leaves our local database. We use 3 LLMs for comparison: Llama3¹⁹ with 8B parameters, Gemma²⁰ with 7B parameters trained on 8T tokens, and Mistral²¹ small 2409, with 22B parameters. The openai python library was used for the experiments as well as an Ollama server to privately serve the LLM models. All models were served using an 8-bit quantized format that preserves performances and lower computation requirement.

Prompting methods

We use 2 types of few-shot prompting where the prompts use texts tagged with phenotypes and time entities.

Prompt type 1: multi-class prompting

Goal: Classify the relation into 1 of 7 defined temporal types (BEGINS-AT, ENDS-AT, BEFORE, BEFORE-OVERLAP, OVERLAP, SIMULTANEOUS, CONTAINS) or NONE.
Structure:
- Each prompt included 2 manually selected examples for every relation type, totaling 14 examples.
- Examples were embedded within a consistent instruction block to create clear task boundaries.
- Target entities (phenotype, time expression) were marked using HTML/XML-style tags to reduce lexical confusion and guide LLM attention.
Text segmentation:
- For each test instance, the prompt extracted:
  - Up to 1000 tokens preceding and following the phenotype.
  - Up to 1000 tokens preceding and following the temporal entity.

This ensured a sufficient contextual window for LLMs while avoiding truncation.

Observed challenge:
- Despite providing diverse examples, LLM performance was hindered, likely due to excessive cognitive load caused by choosing among 7 similar options.

Prompt type 2: binary classification prompting

Goal: Evaluate whether a specific temporal relation exists between a phenotype and time entity—YES (relation type) or NO (NONE). In our binary classification setup, each relation type is predicted independently. We resolve cases where multiple relation types are predicted as true by selecting the one associated with the smallest training sample size. This heuristic is based on the assumption that rarer relation types are less likely to be predicted spuriously and may reflect more specific semantic patterns.
Structure:
- Separate prompt constructed for each relation type (eg one for BEGINS-AT, another for CONTAINS).
- Included 3 positive examples and 3 NONE examples.
- This reduces the task complexity from multi-class to binary classification.
Instructions simplified:
- Task framed as a single decision: Does this example match the BEGINS-AT relation or not?
- All example texts were formatted and labelled identically to maintain consistency.
Advantages:
- Significantly higher LLM performance across all relation types.
- Reflects cognitive load theory—simpler decisions improve model accuracy.
- Empirically better F1 scores as demonstrated in Table 2 of our Results section.

Table 2.

Results for multi-class and binary classification few shots prompts compared to the baseline of zero-shot and OpenNRE.

Model	BEGINS-AT	ENDS-AT	CONTAINS	BEFORE	OVERLAP	BEFORE-OVERLAP	SIMULTANEOUS
Prompt 1 (Multi-class classification)
Llama	0.37	0.42	0.15	0.27	0.03	0.51	0.09
Mistral	0.49	0.54	0.11	0.31	0.11	0.50	0.15
Gemma	0.44	0.33	0.29	0.21	0.25	0.40	0.26
Prompt 2 (Binary classification)
Llama	0.56	0.55	0.18	0.48	0.03	0.67	0.10
Mistral	0.66	0.70	0.17	0.59	0.11	0.65	0.35
Gemma	0.62	0.57	0.40	0.59	0.37	0.66	0.37
Zero-shot prompting
Llama	0.23	0.20	0.10	0.16	0.21	0.17	0.09
Mistral	0.22	0.19	0.10	0.11	0.10	0.30	0.13
Gemma	0.18	0.16	0.19	0.13	0.21	0.17	0.15
OpenNRE ⁶
OpenNRE²¹	0.32	0.45	0.35	0.31	0.22	0.46	0.25

Open in a new tab

The bolded values shows the model with the best performance in the category.

Design rationale:

Few-shot over fine-tuning: Chosen to preserve data privacy and enable reproducibility on local, secure servers.
Manual example curation: Ensured quality and representation for each relation despite limited annotated data.

Consistent formatting: All prompts followed an identical markup and structure to reduce confusion and reinforce pattern learning.

Figure 3 shows the 2 types of prompts that were used in our experiments.

A figure showing few shot prompting in LLM with two subfigures where subfigure A shows multi-class prompting and subfigure B shows binary prompting. — Few-shot prompting LLM approach. (A) Prompting the model to output the relation type between the given text tagged with phenotype and time entities. (B) Prompting the model for each relation type (in figure for “BEGINS-AT” given the text tagged with phenotype and time entities).

Results and analysis

Results achieved by the 3 LLM with 2 types of few-shot prompts are displayed in Table 2. Table 2 also shows the results of binary zero-shot classification and OpenNRE. For binary zero shot classification, we have used similar prompt as prompt 2 without any example, ie the LLMs are given a definition of a single relation type and are asked to classify the input text into 1 of the 2 classes (Relation typed define or NONE); however, there are no previous examples given to demonstrate this relation type to the. LLM. Table 2 clearly shows that the LLM can learn from few representative examples. While the LLM is prompted to identify the relation given several choices, the tested LLMs does not seem to perform very well leading us to believe that they tend to get confused given many choices. On the other hand, while we used the second prompt, the performance of LLM increases, confirming that multiple choices and complex instructions might hinder performances w.r.t a simpler formulation of the problem. It can be seen that for each relation type there is a different LLM that performs better than the other. On an average, it can be argued that Gemma performs better than the other 2 LLM as it has significantly better results for CONTAINS and OVERLAP. However, from this table, we realized that the “OVERLAP,” “CONTAINS,” and “SIMULTANEOUS” relations are complicated relations to capture for an LLM. This could be due to one of the following factors: (i) Not enough examples were used for the few-shot approach. This is seen specifically in the case of SIMULTANEOUS as the number of this relation type is few when compared to the others. (ii) The tested LLMs confuse the definition for the relation. This is seen specifically in the case of BEFORE-OVERLAP and OVERLAP that have similar definitions but can be differentiated better only with the context. (iii) Not enough context is available for the LLM to understand the relation type, causing confusions between similar relation types such as BEFORE, BEFORE-OVERLAP, and OVERLAP.

Comparative evaluation

To contextualize the performance of our few-shot prompting framework, we included comparisons with 2 baseline methods: zero-shot prompting and the OpenNRE toolkit.²¹ Zero-shot prompting involves providing only a relation definition, with no examples included. The LLMs were asked to determine whether a specific relation (eg BEGINS-AT) exists between a marked phenotype and time entity. This minimalist setup tests the models’ general language understanding without domain-specific guidance. We also evaluated the performance of the OpenNRE framework,⁶ which offers pre-implemented neural architectures for relation extraction, including BERT-based models.

Performance by approach

Binary Classification consistently outperforms both multi-class classification and zero-shot approaches across all models and relation types. The average improvement from multi-class to binary classification is approximately 46% (from 0.31 to 0.55 average F1), demonstrating that cognitive load reduction significantly enhances LLM performance in temporal relation extraction.

Multi-class Classification shows moderate performance, with models struggling to distinguish between 7 different relation types simultaneously. This aligns with cognitive load theory, where too many simultaneous choices overwhelm the model’s processing capacity.

Zero-shot Classification demonstrates the poorest performance, confirming that few-shot learning with examples is crucial for this specialized clinical task. Interestingly, the traditional OpenNRE baseline often outperforms zero-shot LLMs, suggesting that domain-specific training remains valuable.

OpenNRE surpassed zero-shot LLMs across all relation types. For example, CONTAINS yielded an F1 score of 0.35, and BEFORE-OVERLAP reached 0.46—both higher than any zero-shot LLM performance.

Interpretation

These comparative results demonstrate that:

Few-shot prompting outperforms both baselines, especially when binary classification reduces cognitive load.
Zero-shot LLM prompting is inadequate for temporal RE without examples.
OpenNRE provides a useful baseline, but lacks the adaptability and privacy-preserving deployment offered by LLMs in our setup.

Model rankings by weighted performance

Mistral (22B): Best overall performer with average F1 of 0.55 (binary classification)
Gemma (7B): Competitive at 0.54 average, with superior performance on complex relations
Llama3 (8B): Underperforms at 0.45 average across relations

The results reveal a bimodal distribution of relation extraction difficulty:

High-performing relations: BEGINS-AT, ENDS-AT, BEFORE-OVERLAP (F1: 0.55-0.70).

Low-performing relations: OVERLAP, CONTAINS, SIMULTANEOUS (F1: 0.03-0.40).

It should also be noted that even with perfect human precision (table in Appendix S1) LLMs struggle significantly, with the best binary classification results (Table 2) still showing substantial gaps compared to human performance. For example, SIMULTANEOUS relations have 0.89 human F1 but only 0.37 best LLM F1.

The improvement from multi-class to binary prompting demonstrates cognitive load theory in action—when LLMs must choose between 7 relations simultaneously, performance degrades dramatically, similar to human working memory limitations.

This gap suggests fundamental differences in how LLMs process temporal semantics, with clear temporal boundaries being more tractable than fuzzy or complex temporal relationships.

Research context and theoretical implications temporal reasoning in clinical contexts

Cognitive Load Theory Application: The dramatic performance difference between binary (Prompt 2) and multi-class (Prompt 1) classification aligns with cognitive load theory. When presented with 7 simultaneous classification options, LLMs exhibit similar limitations to human cognitive processing—the working memory becomes overwhelmed, leading to degraded performance.

Linguistic Complexity in Medical French: The study’s focus on French clinical texts generated by medical doctors without any constraint, to deliver information about rare disease patients’ medical histories, introduces additional complexity layers, such as:

Medical terminology variability: French medical language has multiple ways to express temporal relationships
Syntactic structures: French temporal expressions often involve complex prepositional phrases that may confuse LLMs
Cultural medical reporting styles: French clinical reports may have different narrative structures compared to English texts used in LLM training

Few-Shot Learning Limitations: With only 2-3 examples per relation type in few-shot prompts, the study reveals fundamental limitations in LLM sample efficiency for specialized domains. Thus, complex relations like OVERLAP may require more diverse examples. Furthermore, clinical temporal relations are highly context-dependent, requiring more examples to capture variability. Some relations (SIMULTANEOUS: F1 = 0.09-0.37) may be inherently rare, making few-shot learning ineffective.

Conclusion

In this paper, we have presented an analysis of using LLMs for the task of Relation extraction using 2 formats of few-shot prompts to understand the capabilities of each tested LLMs, showing that a simpler task formulation corresponding to binary classification yields better results. In this work, we focused on LLMs as our motivation was to deploy/share models that can extract information without giving away data (which, given no information is given in the request prompts, an LLM can do). With this goal in mind, the 25 reports we described here have presented us with a good number of relation types between entities so as to enable us to learn more about using LLMs for the Relation Extraction Task. Comparisons of the provided method to BERT based baselines will be made possible in future works by extending the dataset in order to allow fine tuning of said models.

Supplementary Material

ooaf121_Supplementary_Data

ooaf121_supplementary_data.docx^{(27.7KB, docx)}

Contributor Information

Judith Jeyafreeda Andrew, Clinical Bioinformatics Laboratory, INSERM UMR1163, Imagine Institute, Université Paris Cité, Paris, F-75006, France; PRAIRIE, PaRis Artificial Intelligence Research InstitutE, PARIS, 75012, France.

Juliette Potier, Clinical Bioinformatics Laboratory, INSERM UMR1163, Imagine Institute, Université Paris Cité, Paris, F-75006, France.

Nicolas Garcelon, Clinical Bioinformatics Laboratory, INSERM UMR1163, Imagine Institute, Université Paris Cité, Paris, F-75006, France.

Anita Burgun, Clinical Bioinformatics Laboratory, INSERM UMR1163, Imagine Institute, Université Paris Cité, Paris, F-75006, France; PRAIRIE, PaRis Artificial Intelligence Research InstitutE, PARIS, 75012, France; Necker Enfants Malades Hospital, AP-HP, Paris, 75015, France.

Marc Vincent, Clinical Bioinformatics Laboratory, INSERM UMR1163, Imagine Institute, Université Paris Cité, Paris, F-75006, France.

Author contributions

Judith Jeyafreeda Andrew (Conceptualization, Formal analysis, Investigation, Methodology, Writing—original draft, Writing—review & editing), Juliette Potier (Data curation), Nicolas Garcelon (Project administration), Anita Burgun (Conceptualization, Project administration, Supervision, Writing—review & editing), and Marc Vincent (Resources, Supervision, Validation, Writing—review & editing)

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This work was supported by state funding by The French National Research Agency (ANR) under the C’IL-LICO project (ANR-17-RHUS-0002) and as part of the “Investissements d’avenir” program (ANR-19-P3IA-0001) (PRAIRIE 3IA Institute).

Conflicts of interest

The authors have no competing interests to declare.

Data availability

Due to the highly specific nature of the text data used in this study, and the practical impossibility of guaranteeing complete anonymization at the individual level, French regulations do not allow us to publicly share these data outside the framework of the research agreement, the designated research partners, and the scope of patient consent. Aggregate data, however, are available upon request to the authors. Prior to sharing, a data use agreement must be negotiated between the 2 parties in order to comply with applicable guidelines and regulations. In accordance with these regulations,²² the dataset will be archived and stored in a secure enclave. Outside the scope of the current research, access to patient data may be granted only upon submission of a research project to a duly authorized scientific and ethics committee. The process can be initiated by contacting dpd@institutimagine.org.

Ethics

For the purposes of this work, we have followed all GDPR guidance and protocols.

References

1. World Health Organization. The International Classification of Diseases. 1990. Accessed October 12, 2025. http://www.who.int/classifications/icd/en/
2. Kovačević A, Bašaragin B, Milošević N, Nenadić G. De-identification of clinical free text using natural language processing: a systematic review of current approaches. Artif Intell Med. 2024;151:102845. 10.1016/j.artmed.2024.102845 [DOI] [PubMed] [Google Scholar]
3. Campillos L, Deléger L, Grouin C, Hamon T, Ligozat A-L, Névéol A. A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT). Lang Resourc Eval. 2018;52:571-601. [Google Scholar]
4. Pustejovsky J, Castaño JM, Ingria R, et al. TimeML: robust specification of event and temporal expressions in text. In: New Directions in Question Answering. AAAI Technical Report SS-03-07; 2003:28-34. [Google Scholar]
5. Andrew JJ, Vincent M, Burgun A, Garcelon N. Evaluating LLMs for temporal entity extraction from pediatric clinical text in rare diseases context. In: Demner-Fushman D, Ananiadou S, Thompson P, Ondov B, eds. Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024. ELRA and ICCL; 2024:145-152. [Google Scholar]
6. Han X, Gao T, Yao Y, Ye D, Liu Z, Sun M. OpenNRE: an open and extensible toolkit for neural relation extraction. In: Padó S, Huang R, eds. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. Association for Computational Linguistics; 2019:169-174.
7. Long Y, Li Z, Wang X, Li C. Xjnlp at semeval-2017 task 12: clinical temporal information extraction with a hybrid model. In: Bethard S, Carpuat M, Apidianaki M, et al., eds. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics; 2017:1014-1018.
8. Lin C, Miller T, Dligach D, Sadeque F, Bethard S, Savova G. A BERT-based one-pass multi-task model for clinical temporal relation extraction. In: Demner-Fushman D, Cohen KB, Ananiadou S, Tsujii J, eds. Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online. Association for Computational Linguistics; 2020:70-75.
9. Raghavan P, Lai AM, Fosler-Lussier E. Learning to temporally order medical events in clinical text. In: Li H, Lin C-Y, Osborne M, et al., eds Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics (ACL); 2012:70-74.
10. Lin C, Miller T, Dligach D, Amiri H, Bethard S, Savova G. Self-training improves recurrent neural networks performance for temporal relation extraction. In: Lavelli A, Minard A-L, Rinaldi F, eds. Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics; 2018:165-176.
11. Lin C, Miller T, Dligach D, Bethard S, Savova G. A BERT-based universal model for both within-and cross-sentence clinical temporal relation extraction. In: Rumshisky A, Roberts K, Bethard S, Naumann T, eds. Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2019:65-71.
12. Bramsen P, Deshpande P, Lee YK, Barzilay R. Finding temporal order in discharge summaries. In: AMIA Annual Symposium Proceedings. Vol 2006. 2006:81. [PMC free article] [PubMed] [Google Scholar]
13. Wan Z, Cheng F, Mao Z. GPT-RE: in-context learning for relation extraction using large language models. In: Bouamor H, Pino J, Bali K, eds. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2023:3534-3547.
14. Wadhwa S, Amir S, Wallace B. Revisiting relation extraction in the era of large language models. In: Rogers A, Boyd-Graber J, Okazaki N, eds. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics; 2023:1556615589. [DOI] [PMC free article] [PubMed]
15. Thakur A, Zhu T Clifton D, Shah NH, Youssef A, Pencina M. External validation of AI models in health should be replaced with recurring local validation. Nat Med. 2023;29:2686-2687. 10.1038/s41591-023-02540-z [DOI] [PubMed] [Google Scholar]
16. Garcelon N, Neuraz A, Salomon R, et al. A clinician friendly data warehouse oriented toward narrative reports: Dr. warehouse. J Biomed Inform. 2018;80:52-63. [DOI] [PubMed] [Google Scholar]
17. Vincent M, Douillet M, Lerner I, Neuraz A, Burgun A, Garcelon N. Using deep learning to improve phenotyping from clinical reports. Stud Health Technol Inform. 2022;290:282-286. [DOI] [PubMed] [Google Scholar]
18. Stenetorp P, Pyysalo S, Topi′c G, Ohta T, Ananiadou S, Tsujii J. BRAT: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics; 2012:102-107.
19. Dubey A, Jauhri A, Pandey A, et al. 2024. The Llama 3 herd of models. arXiv, preprint arXiv:2407.21783, preprint: not peer reviewed.
20. Team G, Mesnard T, Hardin C, et al. 2024. Gemma: open models based on Gemini research and technology. arXiv, preprint arXiv:2403.08295, preprint: not peer reviewed.
21. Jiang AQ, Sablayrolles A, Roux A, et al. 2024. Mixtral of experts. arXiv, preprint arXiv:2401.04088, preprint: not peer reviewed.
22. CNIL. Accessed October 12, 2025. https://www.cnil.fr/sites/cnil/files/atoms/files/referentiel_-_recherches_dans_le_domaine_de_la_sante.pdf

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ooaf121_Supplementary_Data

ooaf121_supplementary_data.docx^{(27.7KB, docx)}

Data Availability Statement

[ooaf121-B1] 1. World Health Organization. The International Classification of Diseases. 1990. Accessed October 12, 2025. http://www.who.int/classifications/icd/en/

[ooaf121-B2] 2. Kovačević A, Bašaragin B, Milošević N, Nenadić G. De-identification of clinical free text using natural language processing: a systematic review of current approaches. Artif Intell Med. 2024;151:102845. 10.1016/j.artmed.2024.102845 [DOI] [PubMed] [Google Scholar]

[ooaf121-B3] 3. Campillos L, Deléger L, Grouin C, Hamon T, Ligozat A-L, Névéol A. A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT). Lang Resourc Eval. 2018;52:571-601. [Google Scholar]

[ooaf121-B4] 4. Pustejovsky J, Castaño JM, Ingria R, et al. TimeML: robust specification of event and temporal expressions in text. In: New Directions in Question Answering. AAAI Technical Report SS-03-07; 2003:28-34. [Google Scholar]

[ooaf121-B5] 5. Andrew JJ, Vincent M, Burgun A, Garcelon N. Evaluating LLMs for temporal entity extraction from pediatric clinical text in rare diseases context. In: Demner-Fushman D, Ananiadou S, Thompson P, Ondov B, eds. Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024. ELRA and ICCL; 2024:145-152. [Google Scholar]

[ooaf121-B6] 6. Han X, Gao T, Yao Y, Ye D, Liu Z, Sun M. OpenNRE: an open and extensible toolkit for neural relation extraction. In: Padó S, Huang R, eds. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. Association for Computational Linguistics; 2019:169-174.

[ooaf121-B7] 7. Long Y, Li Z, Wang X, Li C. Xjnlp at semeval-2017 task 12: clinical temporal information extraction with a hybrid model. In: Bethard S, Carpuat M, Apidianaki M, et al., eds. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics; 2017:1014-1018.

[ooaf121-B8] 8. Lin C, Miller T, Dligach D, Sadeque F, Bethard S, Savova G. A BERT-based one-pass multi-task model for clinical temporal relation extraction. In: Demner-Fushman D, Cohen KB, Ananiadou S, Tsujii J, eds. Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online. Association for Computational Linguistics; 2020:70-75.

[ooaf121-B9] 9. Raghavan P, Lai AM, Fosler-Lussier E. Learning to temporally order medical events in clinical text. In: Li H, Lin C-Y, Osborne M, et al., eds Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics (ACL); 2012:70-74.

[ooaf121-B10] 10. Lin C, Miller T, Dligach D, Amiri H, Bethard S, Savova G. Self-training improves recurrent neural networks performance for temporal relation extraction. In: Lavelli A, Minard A-L, Rinaldi F, eds. Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics; 2018:165-176.

[ooaf121-B11] 11. Lin C, Miller T, Dligach D, Bethard S, Savova G. A BERT-based universal model for both within-and cross-sentence clinical temporal relation extraction. In: Rumshisky A, Roberts K, Bethard S, Naumann T, eds. Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2019:65-71.

[ooaf121-B12] 12. Bramsen P, Deshpande P, Lee YK, Barzilay R. Finding temporal order in discharge summaries. In: AMIA Annual Symposium Proceedings. Vol 2006. 2006:81. [PMC free article] [PubMed] [Google Scholar]

[ooaf121-B13] 13. Wan Z, Cheng F, Mao Z. GPT-RE: in-context learning for relation extraction using large language models. In: Bouamor H, Pino J, Bali K, eds. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2023:3534-3547.

[ooaf121-B14] 14. Wadhwa S, Amir S, Wallace B. Revisiting relation extraction in the era of large language models. In: Rogers A, Boyd-Graber J, Okazaki N, eds. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics; 2023:1556615589. [DOI] [PMC free article] [PubMed]

[ooaf121-B15] 15. Thakur A, Zhu T Clifton D, Shah NH, Youssef A, Pencina M. External validation of AI models in health should be replaced with recurring local validation. Nat Med. 2023;29:2686-2687. 10.1038/s41591-023-02540-z [DOI] [PubMed] [Google Scholar]

[ooaf121-B16] 16. Garcelon N, Neuraz A, Salomon R, et al. A clinician friendly data warehouse oriented toward narrative reports: Dr. warehouse. J Biomed Inform. 2018;80:52-63. [DOI] [PubMed] [Google Scholar]

[ooaf121-B17] 17. Vincent M, Douillet M, Lerner I, Neuraz A, Burgun A, Garcelon N. Using deep learning to improve phenotyping from clinical reports. Stud Health Technol Inform. 2022;290:282-286. [DOI] [PubMed] [Google Scholar]

[ooaf121-B18] 18. Stenetorp P, Pyysalo S, Topi′c G, Ohta T, Ananiadou S, Tsujii J. BRAT: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics; 2012:102-107.

[ooaf121-B19] 19. Dubey A, Jauhri A, Pandey A, et al. 2024. The Llama 3 herd of models. arXiv, preprint arXiv:2407.21783, preprint: not peer reviewed.

[ooaf121-B20] 20. Team G, Mesnard T, Hardin C, et al. 2024. Gemma: open models based on Gemini research and technology. arXiv, preprint arXiv:2403.08295, preprint: not peer reviewed.

[ooaf121-B21] 21. Jiang AQ, Sablayrolles A, Roux A, et al. 2024. Mixtral of experts. arXiv, preprint arXiv:2401.04088, preprint: not peer reviewed.

[ooaf121-B22] 22. CNIL. Accessed October 12, 2025. https://www.cnil.fr/sites/cnil/files/atoms/files/referentiel_-_recherches_dans_le_domaine_de_la_sante.pdf

PERMALINK

Using large language models for temporal relation extraction from pediatric clinical reports

Judith Jeyafreeda Andrew, PhD

Juliette Potier, MSc

Nicolas Garcelon, PhD

Anita Burgun, MD, PhD

Marc Vincent, PhD

Roles

Abstract

Objectives

Materials and Methods

Results

Discussion

Conclusion

Introduction

Related work

Annotation guidelines for relation extraction

Temporal relation extraction

Challenges in temporal relation extraction for rare diseases

Dataset

Figure 1.

Figure 2.

Table 1.

Relation types

Experimental set-up

Annotation process

Large language models

Prompting methods

Prompt type 1: multi-class prompting

Prompt type 2: binary classification prompting

Table 2.

Figure 3.

Results and analysis

Comparative evaluation

Performance by approach

Interpretation

Model rankings by weighted performance

Research context and theoretical implications temporal reasoning in clinical contexts

Conclusion

Supplementary Material

Contributor Information

Author contributions

Supplementary material

Funding

Conflicts of interest

Data availability

Ethics

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases