Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Oct 24;67(2):741–752. doi: 10.1111/epi.18683

Comparing three natural language processing methods for the automatic identification of epilepsy patients from French clinical notes

François Le Gac 1, Quentin Calonge 1,2,3, Candice Estellat 4, Vincent Navarro 1,2,3,
PMCID: PMC12927673  PMID: 41133988

Abstract

Objective

Manual review of clinical notes by experts remains the reference standard for identifying patients with epilepsy in health databases. However, this process is labor‐intensive and time‐consuming due to the unstructured nature of text. Prior studies have shown the potential of natural language processing for automated phenotyping. We aim to develop and validate algorithms capable of identifying patients with epilepsy based on a set of clinical notes.

Methods

A population of 109 448 patients was selected from the Assistance Publique‐Hôpitaux de Paris (AP‐HP) Clinical Data Warehouse (CDW) (38 hospitals in Paris, France) based on the presence of an International Classification of Diseases, Tenth Revision (ICD‐10) diagnostic code related to epilepsy (G40/G41) or mimicking disorders (R53/R55/R56), or the mention of at least one antiseizure medication in their medical chart. From this pre‐screened population, 6733 sentences (from 2700 patients) were labeled as indicative or not indicative of epilepsy, and 3000 patients were selected randomly for manual review by a neurologist. We compared a “basic” keyword‐based method, a rule‐based method, and a pretrained language model for identifying epilepsy‐related sentences and classifying patients with epilepsy. We reported the F 1 score of each method.

Results

At the sentence level, the pretrained language model reached the highest F 1 score of .95 (95% confidence interval [CI]: .95–.96) outperforming the rule‐based method .87 (95% CI: .86–.88) and the basic method .81 (95% CI: .80–.81). At the patient level, the pretrained language model also achieved the best F 1 score .95 (95% CI: .94–.96) compared to the rule‐based method .93 (95% CI: .91–.94) and the basic method .82 (95% CI: .81–.84).

Significance

Both the rule‐based and the pretrained language models achieved high performance. These algorithms can automatically identify patients with epilepsy from unstructured clinical notes in French data warehouses, supporting large‐scale phenotyping and the detection of epilepsy as a comorbidity.

Keywords: Automated phenotyping, Clinical data warehouse, electronic health record, epilepsy, natural language processing (NLP), Pretrained language model


Key points.

  • We developed and validated three natural language processing (NLP) methods to identify patients with epilepsy from clinical notes of Assistance Publique‐Hôpitaux de Paris (AP‐HP) Clinical Data Warehouse (CDW) (38 hospitals in Paris, France).

  • The pretrained language model achieved the highest F1 score (0.95) for both the sentence‐level and the patient‐level evaluation tasks.

  • Our methods can enable large‐scale epilepsy phenotyping in French CDWs including patients seen in inpatient and outpatient settings.

1. INTRODUCTION

Epilepsy is a chronic neurological disorder, characterized by the repetition of spontaneous epileptic seizures. 1 Epilepsy is a public health issue, with an estimated 45.9 million people living with epilepsy (PwE) in 2016 2 and an incidence rate estimated at 61.4 per 100 000 person‐years. 3

Electronic health records (EHRs) offer an opportunity to improve the understanding of epilepsy epidemiology and enhance patient management by leveraging large volumes of real‐world data. 4 However, International Classification of Diseases, Tenth Revision (ICD‐10) codes (G40, G41) to identify PwE may lack sensitivity, particularly when patients are seen only in outpatient consultations, 5 since diagnosis codes are assigned only during inpatient stays in France. Manually reviewing all medical reports to confirm or rule out the diagnosis is not feasible due to the large number of patients. Natural language processing (NLP), a branch of artificial intelligence, could significantly streamline the use of medical reports by automating the extraction of information from free‐text data. Such approaches have been applied in epilepsy for various purposes, including identifying patients with psychogenic non‐epileptic seizures (PNES), calculating surgical scores, and detecting psychiatric comorbidities, using both rule‐based and machine learning methods. 6

A recent study 7 identified PwE within a cohort with excellent performance (area under the receiver‐operating characteristic curve [AUROC] and area under the precision‐recall curve [AUPRC] = 1). However, this study relied mostly on patients seen in epilepsy units, raising concerns about the generalizability of these algorithms to broader health data repositories, where patients receive care across multiple departments (emergency, intensive care, surgery). 8

In this study, we developed and compared three NLP methods: a rule‐based, a deep‐learning, and a basic methods to automatically identify PwE based on a set of French clinical notes from 38 different hospitals in Paris, including pre‐screened patients managed in both inpatient and outpatient settings. The three methods were first evaluated based on their ability to distinguish between sentences that indicate epilepsy and those that do not, and subsequently assessed for their performance in identifying patients with epilepsy from entire set of clinical notes.

2. METHODS

2.1. Study design

We conducted a retrospective study using the Assistance Publique‐Hôpitaux de Paris (AP‐HP) Clinical Data Warehouse (CDW), which includes data from all patients hospitalized or followed as outpatients across 38 hospitals of the Greater Paris Area. This represents 1.5 million hospitalizations per year (10% of all hospitalizations in France 9 ) and around 5 million of outpatient visits per year. This database gathers a large amount of patient health information including medical reports (hospitalization, consultation, imaging, etc.) of patients managed at AP‐HP, as well as all diagnostic discharge codes classified according to the ICD‐10 for hospitalized patients. French regulation does not require a patient's written consent for this type of research, but the patients were informed of this research, and those who objected to the secondary use of their data were excluded from the study. Data were pseudonymized by replacing names and places of residence with aliases.

2.2. Participants

We first selected all potential PwE from the AP‐HP database and patients with disorders that could resemble epilepsy. We extracted patients with a principal diagnosis or a secondary diagnosis linked to epilepsy according to the ICD‐10 classification (G40, G41). We also extracted patients with mention of at least one antiseizure medication (ASM) (the full list is provided in Table S1) in at least one clinical note. ASMs can be prescribed for various conditions beyond epilepsy, including chronic pain or psychiatric disorders. In addition, we extracted patients with a principal diagnosis or secondary diagnosis for a condition mimicking epilepsy, namely “convulsions not elsewhere classified” (R56), “Syncope and collapse except shock” (R55), and “malaise and fatigue” (R53). The extraction period spanned from July 1, 2017 to December 31, 2019. We deliberately excluded data from 2020 onward to avoid potential biases related to changes in clinical documentation and health care utilization during the coronavirus disease 2019 (COVID‐19) pandemic. Patients were excluded if they lacked at least one informative type of medical note (as detailed in Table S2) or if all their clinical notes were empty. Some medical reports may occasionally be empty due to defective PDF digitization.

2.3. Labeling of sentences and patients

2.3.1. Preprocessing

We preprocessed the clinical notes for both sentence and patient evaluations. We first segmented the clinical notes into sets of sentences. A frequent difficulty in the medical corpus was that line break did not necessarily correspond to a real new line, especially in scanned PDFs. We used a pretrained algorithm 10 to detect false endlines and get the full context of the sentence. Raw sentences were also transformed into a cleaner and more structured form, using lowercasing, exclusion of accents, converting quotation marks to standard form, and removal of pollutions (e.g., encoding artifacts and html tags, and so on).

2.3.2. Labeling of sentences

We extracted 24 551 sentences from the clinical notes of 2700 patients who were selected randomly from our pre‐screened population. Sampling was stratified across the 38 participating hospitals, with an equal number of patients selected from each institution. This approach was intended to ensure that all hospitals, including those not specialized in epilepsy and therefore at higher risk of underrepresentation in purely random sampling—were adequately represented. It also reduced the risk of our algorithms overfitting to the writing style of any single institution. Every sentence extracted from these patients after preprocessing contained at least one keyword from a dictionary of words specific to the pathology such as “epilepsy,” “epileptic,” or “focal seizure” (the full list is provided in the Table S3). The dataset included sentences such as “Absence of epileptic abnormalities,” “Epileptic patient followed for 3 years,” or “Mother has epilepsy.” A neurologist specialized in epilepsy (Q.C.) annotated the sentences as “indicating epilepsy” or “not indicating epilepsy” following precise sentence annotation guidelines (Supporting Information Method 1).

2.3.3. Labeling of patients

To assess the methods at the patient level, we randomly selected 3000 patients from our pre‐screened population, excluding the 2700 patients used for sentence‐level evaluation. The sampling was also stratified by the 38 different hospitals. A neurologist (Q.C.) then manually reviewed each patient and annotated the patient as epileptic or not. Patients with a mention of a remote history of epilepsy (>10–20 years) and without any seizure for all the following years and without treatment were annotated as non‐epileptic. Patients who exclusively had febrile seizures (i.e., convulsions in a child that are caused by a fever) or with only PNES (i.e., paroxysmal episodes resembling epileptic seizures but arising from psychological factors rather than abnormal neuronal activity) were annotated as non‐epileptic.

In case of doubt about the labeling of a patient, a second expert (V.N.) provided an additional review and the final decision was reached by consensus between the two reviewers.

2.4. NLP analyses at sentence level

We first aimed to develop three automated methods capable of distinguishing sentences that indicate epilepsy from those that do not. Of the 24 551 labeled sentences, 12 473 were randomly assigned to the training set and 5345 to the validation set of the pretrained language model. The remaining 6733 sentences were used to compare the performance of all three methods (test set). We ensured that the target class, namely sentences indicating epilepsy, was equally distributed across the three datasets.

2.4.1. The rule‐based method

The rule‐based approach is a method where rules are predefined. The rule‐based algorithm detects the negation (e.g., “Absence of epileptic abnormality”), hypothesis (e.g., “Suspicion of seizure related to epilepsy”), family link (e.g., “paternal grandfather: epilepsy”), reported speech (e.g., “The patient reports having had an epileptic seizure”), email (e.g. “doctor's contact: smith@epilepsycenter.com), and URL (e.g., ” https://parisepilepsycenter.com”) in a sentence. These qualifiers were implemented using the Entrepôt de données de santé ‐ Natural Language Processing (EDS‐NLP) library. 11 They are all based on hand‐crafted rules, meaning that they rely on the detection of specific words or punctuation marks in the sentence (e.g., “Absence of” or “no” for the negation qualifier). If one of the qualifiers is triggered, the sentence is discarded and labeled as “not indicating epilepsy.” We also defined a dictionary of excluding words that should discard the sentence if detected near the keyword of interest. For example, if “reference center” or “day hospital” was located just before “epilepsy,” the sentence was discarded and predicted as “not indicating epilepsy” (see the full list of excluding keywords in Supporting Information Method 2). Otherwise, the sentence is predicted as “indicating epilepsy.”

2.4.2. The pretrained language model

The pretrained language model takes a sentence as input and returns a probability. If the probability is superior to a threshold, then the sentence is predicted as “indicating epilepsy” and predicted as “not indicating epilepsy” otherwise. Rather than relying on rules for negation or hypothesis, the pretrained language model infers these notions from training examples. We started from CamemBERT, 12 a model built on the Transformer architecture and previously self‐trained on a large corpus of French texts. Self‐training is a typical process whereby the model starts by understanding grammar, syntax, and general meanings of words. Then the model is further trained on sentences with corresponding labels, a process called fine‐tuning. In our case, we provided 17818 sentences, each with a label “indicating epilepsy” or “not indicating epilepsy.” We fine‐tuned CamemBERT until reaching satisfying performances on the validation set of 5345 sentences.

2.4.3. The basic method

To evaluate the relative utility of both the rule‐based and the pretrained language models, we added a basic method: each time that a keyword from epilepsy dictionary was detected in a sentence, the sentence was predicted as indicating epilepsy without any upstream qualifiers (negation, hypothesis…). An overview of the three methods is illustrated in Figure 1.

FIGURE 1.

FIGURE 1

Comparison of the three methods for sentence binary classification. In this example: (Left) the basic approach predicts the sentence as “indicating epilepsy“ because of the presence of the keyword “epilepsy“ in the sentence. (Middle) The rule‐based method predicts the sentence as “not indicating epilepsy“ due to the trigger of the negation qualifier. (Right) The probability output of the pretrained language model is p = .1; therefore the model predicts the sentence as “not indicating epilepsy.“ Underlined outputs are the results of each classification method. *The epilepsy dictionary is available in Table S3). **A list of excluding keywords is available in Supporting Information Method 2).

2.5. The patient‐level evaluation workflow

We started by extracting all the sentences that contained at least one keyword from the epilepsy dictionary for all patients in the 3000 patient test set. We preprocessed the sentences and then applied the basic, the rule‐based, and the pretrained language models. Finally, we applied the following aggregation strategy: if one or more sentences were found as “indicating epilepsy” in one or more clinical notes of the patient, then the patient should be classified as epileptic. The workflow of patient‐level evaluation is illustrated in Figure 2.

FIGURE 2.

FIGURE 2

Workflow for identifying epileptic patients from clinical notes.

We evaluated alternative aggregation strategies by varying the number of indicative sentences required to classify a patient as epileptic, from 1 (Strategy 1, S1) to 7 (Strategy 7, S7), and computed F1 scores for all three methods. We also assessed combinations of ICD‐10 codes and ASM use, relying on ASM mentions in clinical notes given the absence of structured outpatient ASM data (see Table S4 for the ASM list).

2.6. Metrics used

We used the recall, the precision, and the F1 score to assess the performance of our algorithms. Precision is also known as positive predictive value and recall is also referred as sensitivity in diagnostic binary classification. The F1 score is the harmonic mean of the precision and recall and thus symmetrically represents both precision and recall in one metric. Precision/recall are better suited for binary classification evaluation than sensitivity/specificity in the case of imbalanced datasets. 13 The output probabilities from the pretrained language model were converted into binary decisions using the threshold that maximized the average F1 score across both sentence‐level and patient‐level evaluations. We used McNemar's statistical test to compare the prediction errors of the three main methods. McNemar's test compares the performance of a pair of models on the same dataset. We performed 1000 bootstrapping iterations to calculate 95% confidence intervals (CIs).

2.7. Focus on patients with diagnostic codes G40 (epilepsy) and G41 (status epilepticus)

We evaluated the three methods on all patients who received a diagnostic code G40 or G41 between the July 1, 2017, and December 31, 2019. The goal was to ensure that our algorithms primarily classified patients with diagnostic codes G40 and G41 as epileptic. The discrepancies between the diagnostic codes and the algorithms predictions were reviewed manually by an expert (Q.C.).

2.8. Application of the three methods on the pre‐screened population

We applied the rule‐based approach, the pretrained language model, and the basic approach on the pre‐screened population. This analysis aimed to assess differences between the methods in terms of numbers of detected patients with epilepsy on a large scale.

3. RESULTS

3.1. Participants

A total of 113 116 patients from the AP‐HP database met our inclusion criteria, namely a principal diagnosis or secondary diagnosis code from the following list: G40, G41, R53, R55, or R56, or a mention of one ASM in at least one clinical note from July 1, 2017 to December 31, 2019. Of these, 3668 patients were excluded due to the absence of any readable or informative clinical notes, resulting in a final population of 109 448 patients. The flow diagram is illustrated in Figure 3.

FIGURE 3.

FIGURE 3

Flow diagram. n, number of patients. ICD‐10 codes: G40 = epilepsy; G41 = status epilepticus; R56 = convulsions not elsewhere classified; R55 = syncope and collapse except shock; R53 = malaise and fatigue.

3.2. Sentence‐level evaluation

The training set contained (8357; 67%) sentences labeled by the expert as “indicating epilepsy,” the validation set (3582; 67%), and the test set (4512; 67%). Performance results of our methods are shown in Table 1. The pretrained language model achieved the highest F1 score with .95 (95% CI: .95–.96) compared to the rule‐based and the basic methods with, respectively, .87 (95% CI: .86–.88) and .81 (95% CI: .80–.81). Of the 6733 sentences of the sentence test set, the pretrained language model made 207 false negatives (FNs) and 211 false positives (FPs). The rule‐based method produced 810 FNs and 295 FPs. The confusion matrices are available in Figure S1.

TABLE 1.

Comparison of performance between the basic method, rule‐based model, and the pretrained language model at the sentence level against a test set of 6733 sentences (67% indicating epilepsy).

Precision Recall F1
Basic method .67 (.66–.69) 1 .81 (.80–.81)
Rule‐based model .93 (.92–.93) .82 (.81–.83) .87 (.86–.88)
Pretrained language model .95 (.95–.96) .95 (.95–.96) .95 (.95–.96)

Note: Precision = positive predictive value; recall = sensitivity. The highest values for each metric are shown in bold.

3.3. Patient‐level evaluation

The characteristics of the study cohort and the pre‐screened population are summarized in Table 2. Among the 3000 patients manually labeled, 25% were annotated as epileptic by the expert. The majority of patients in the cohort (63%) had no mention of epilepsy, whereas 17% had more than 10 mentions related to epilepsy, and the remaining 20% had between 1 and 9 mentions.

TABLE 2.

Demographic and clinical characteristics in the study cohort vs pre‐screened population participants.

Pre‐screened population Study cohort
Number of patients (n) 109,448 3,000
Number of notes (N) 3 944 430 129 215
Percentage of epileptic patients

N/A

Pretrained language and rule‐based estimates: 37%

26%
Percentage of epileptic patients with diagnostic code related to epilepsy (G40/G41)

N/A

Pretrained language and rule‐based estimates: 26%

21%
Percentage of epileptic patients without any keyword from the epilepsy dictionary N/A .1%
Percentage of patients without any keyword from the epilepsy dictionary 53% 63%
Sex (m/f) 49%/51% 48%/52%
Age, years
0–18

11 789 (11%)

24 340 (22%)

26 493 (24%)

46 826 (43%)

341 (11%)

504 (17%)

655 (22%)

1500 (50%)

19–45
46–65
66–106
Epilepsy diagnostic codes distributions

Patients with a G40

principal diagnosis

10 039 (9%) 149 (5%)

Patients with a G41

principal diagnosis

3733 (3%) 63 (2%)
Patients with a G40 or G41 principal diagnosis 11 111 (11%) 185 (6%)

Five most frequent

principal diagnoses a

Z098 13%

R53+0 10%

Z511 9%

Z515 6%

Z512 5%

Z098 11%

Z511 9%

R53+0 8%

Z512 5%

Z515 5%

Clinical notes

Median number of

notes per patient

22 (IQR: 9–47) 30 (IQR: 14–58)

Median number of informative notes a

per patient

9 (IQR: 4–19) 12 (IQR: 5–24)
Five most frequent note types
Imaging reports 17% Prescriptions 18%
Prescriptions 15% Imaging reports 17%
Consultation reports 12% Consultation reports 12%
Emergency reports 9% Emergency reports 8%
Other documents 9% Inpatient report 7%
a

Principal diagnosis code labels can be found in Table S5.

Performance comparisons between the three methods and ICD‐10 + ASM combinations are summarized in Table 3. As in the sentence‐level evaluation, the pretrained language model achieved the highest F1 score, with .95 (95% CI: .94–.96) compared to the rule‐based model and the basic method, with, respectively, .93 (95% CI: .91–.94) and .82 (95% CI: .81–.84).

TABLE 3.

Comparison of performance between the basic method, rule‐based model, and the pretrained language model, and diagnostic code ICD‐10 + ASM mention combinations at the patient level, against a test set of 3000 patients (25% of epileptic patients).

Precision Recall F1
Basic method .70 (.68–.73) .99 (.99–1) .82 (.81–.84)
Rule‐based model .88 (.86–.91) .97 (.96–.98) .93 (.91–.94)
Pretrained language model .92 (.90–.94) .98 (.97–.99) . 95 (.94–.96)
ICD‐10: G40, G41 .88 (.85–.91) .63 (.59–.67) .74 (.71–.76)
ICD‐10: G40, G41, R56 .86 (.84–.89) .67 (.64–.70) .76 (.73–.78)
ICD‐10: G40, G41 and ≥1 ASM mention(s) .94 (.92–.96) .54 (.50–.57) .68 (.65–.71)
ICD‐10: G40, G41 and ≥2 ASM mention(s) .96 (.94–.97) .49 (.46–.52) .65 (.62–.68)
ICD‐10: G40, G41 and ≥3 ASM mention(s) . 98 (.96–.99) .42 (.38–.45) .58 (.55–.62)

ICD‐10: G40, G41,

R56 or ≥1 ASM mention(s)

.78 (.76–.81) .96 (.94–.97) .86 (.84–.88)

ICD‐10: G40, G41,

R56 or ≥2 ASM mention(s)

.82 (.79–.84) .91 (.89–.93) .86 (.84–.88)

ICD‐10: G40, G41,

R56 or ≥3 ASM mention(s)

.83 (.81–.86) .85 (.83–.88) .84 (.82–.86)

Note: Precision = positive predictive value; recall = sensitivity. G40 = epilepsy, G41 = status epilepticus, R56 = convulsions not elsewhere classified. The highestvalues for each metric are shown in bold.

Abbreviation: ASM, antiseizure medication.

Pairwise McNemar's tests were performed to compare the pretrained language model, the rule‐based method, and the basic approach. All comparisons showed statistically significant differences after Bonferroni correction (α = .05/3 ≈ .0167). The confusion matrices are provided in Figure S2 and more details about performance by age group are in Table S6. A threshold of .5 was selected for the pretrained language model, as the average F1 score between the sentence‐ and patient‐level evaluations plateaued at .95 within the .3–.6 range. More details about threshold selection can be found in Figure S3. All three methods consistently outperformed ICD‐10 + ASM–based strategies with respect to F1 score.

Increasing the required number of sentences “indicating epilepsy” from 1 to 7 sentences before classifying a patient as epileptic resulted in a performance decline for both the pretrained language model and the rule‐based approach. The F1 score of the pretrained language model decreased from .95 (95% CI: .94–.96) to .77 (95% CI: .75–.80). The F1 score of the rule‐based approach also decreased from .93 (95% CI: .91–.94) to .75 (95% CI: .72–.78). F1 scores across the different strategies are illustrated in Figure S4.

3.4. Focus on G40 (epilepsy) and G41 (status epilepticus) patients

We found 11 111 patients with a G40 or G41 diagnostic code in the pre‐screened population (10.1% of the 109 448 participants) with 1 059 384 notes associated. They were all predicted as epileptic by our algorithms other than 346 patients for the rule‐based method and 383 patients for the pretrained language model, with an intersection of 246 patients. Upon verification by the expert (Q.C.), 64.3% of the 346 patients predicted as non‐epileptic by the rule‐based approach were confirmed as non‐epileptic; 81.7% of the 386 patients predicted as non‐epileptic by the pretrained language model were confirmed as non‐epileptic.

Computing the false‐positive rate on the full G40/G41 cohort (11 111 patients) would require extensive manual labeling, which was not feasible. However, within the labeled subset of 3000 patients, 185 had a principal diagnosis of G40/G41, among whom only 5 were false positives.

3.5. Pre‐screened population results

We applied the three methods on the pre‐screened population of 109 448 patients with 3 944 430 notes associated. The rule‐based and the pretrained language model, respectively, identified 40 195 PwE (36.7% of the population) and 40 946 (37.4%). Both methods agree on 38 436 cases. The basic method identified 51 390 patients with epilepsy (46.9%).

Only 26.0% and 26.2% of the patients predicted as epileptic. Respectively. by the rule‐based and the pretrained language model approach had a diagnostic code related to epilepsy.

4. DISCUSSION

In this study, we developed a rule‐based method, a pretrained language model, and a basic method. The first two methods demonstrated high performance in distinguishing epilepsy‐indicating sentences and identifying PwE from EHRs. The pretrained language model achieved the best performance, with an F1 score of .95 (95% CI: .95–.96) on a test set of 6,733 sentences and .95 (95% CI: .94–.96) on a separate patient test set of 3,000 patients from 38 hospitals. Although comparing our finetuned CamemBERT with other generative language models such as GPT‐4 or LLaMA‐3, as in study, 14 would have been informative, CamemBERT was chosen for its much smaller size and its compatibility with the AP‐HP platform and security constraints. The combination of ICD‐10 codes and ASM yielded inferior results compared to the rule‐based and the pretrained language model, with a maximum F1 score of .86 (95% CI: .84–.86) for the combination “ICD‐10: G40, G41, R56 or ≥1 ASM mention.”

Among the 11 111 patients with an epilepsy‐related diagnostic code (G40/G41), the pretrained language model and rule‐based method identified 96.9% and 96.5% as epileptic, respectively. In addition, 81.7% of the 346 discrepancies between the diagnostic codes and the pretrained language model predictions were confirmed as non‐epileptic by the expert, aligning with the algorithm prediction rather than the diagnostic code. These cases corresponded to patients initially suspected of epilepsy but ruled out in a more recent clinical note. These results are aligned with prior studies that have shown that NLP algorithms could be used to identify ICD‐coding errors. 15 , 16

Identification of PwE across the pre‐screened population of 109 448 patients revealed a discrepancy of 11 195 patients between the rule‐based and the basic method and 10444 patients between the pretrained language model and the basic method. This finding emphasized the utility of the rule‐based and the pretrained language model over the basic method. Furthermore, the low proportion of patients with a G40 or G41 code among those identified as epileptic (26.0% for the pretrained language model and 26.1% for the rule‐based approach) highlighted the usefulness of our methods for developing studies that encompass patients managed in both inpatient and outpatient settings.

Previous studies have identified PwE on EHR data by combining ICD codes, electroencephalography, and ASM consumption with no text mining on clinical notes. 5 , 17 , 18 These studies were conducted on health care administrative databases, which typically lack clinical information. A systematic review 19 of 30 studies concluded that it is reasonable to use these combinations to identify PwE but advised use of clinical notes when available to improve and validate the detections.

In a recent study, Fernandes and colleagues 7 have shown the potential of text‐mining algorithms for the automatic identification of PwE using clinical notes. With a machine learning model trained on features extracted from notes (bag of words), diagnostic codes, and ASMs, they achieved an F1 score of .99 on a test set of 1729 patients sampled randomly from 9 hospitals. In a systematic review study, Yew and colleagues 6 highlighted 15 other studies for patient identification using NLP in the context of epilepsy. Two of these studies distinguished patients with epilepsy from those with PNES, 20 , 21 and one study 22 showed that it was feasible to detect subtypes of epilepsy such as generalized or focal epilepsy with an F1 score of .85. The systematic review points out the lack of generalizability of most NLP algorithms that were trained and validated on textual data from a single dataset or institution. Variations in clinical practice settings, EHR templates, and terminology used across different institutions may impede the generalizability of the models. In contrast, the inclusion of data from 38 different hospitals enhances the generalizability of our study. The systematic review also warns against the lack of interpretability of NLP models, as users are often blinded to the underlying information used to generate the output. In contrast, both the rule‐based and the pretrained language models display the mention in the clinical notes of a patient that triggered its final diagnosis.

To our knowledge, no studies have been conducted for automatic detection of PwE using French clinical notes. Using a rule‐based and a pretrained language model method for the automatic detection of diabetes, solid tumor, or leukemia in clinical notes, 23 these methods showed F1 scores of 96.6, 95.6, and 96.3, respectively. Bey and colleagues 24 used a similar workflow to identify populations whose suicidality was the most affected during the COVID‐19 pandemic.

The study also presents some limitations. First, both the rule‐based method and the pretrained language model are restricted to French clinical notes. Although limited to a single language, the methods developed in this study remain highly relevant given the growing use of EHR‐based data warehouses in France and other French‐speaking countries. Furthermore, our methods were applied on patients from a pre‐screened population, which may not reflect performance on the general population. However, performance is expected to be higher in practice, as the pre‐screened sample includes PwE and patients with related disorders, making discrimination more challenging than with unrelated conditions. Third, our methods do not distinguish between patients with chronic epilepsy and those who have experienced a single seizure. Further research is needed to differentiate these two scenarios. Fourth, we assumed any epilepsy mention indicated diagnosis, although reliability varies (e.g., neurologist may be more trustworthy than emergency department). Even if our approach yielded good performance, external validation in another dataset, such as a clinical data warehouse, would be valuable to ensure its reproducibility. Fifth, the validation of machine learning applications is typically based on independent review by three experts, which allows for measurement of inter‐ and intra‐rater reliability. In our study, patients were reviewed by a single expert with verification of a second expert for complex cases. Another limitation concerns patients whose initial epilepsy diagnosis was later revoked. In these cases, a patient was initially diagnosed with epilepsy, but the diagnosis was removed subsequently in a more recent clinical note. As both the rule‐based approach and the pretrained language model are triggered by the first mention, they will classify the patient as epileptic, even in cases where the diagnosis is ultimately incorrect. Another limitation is the exclusion of COVID years (2020–2021) as analyses were restricted to 2017–2019; future studies should verify algorithm performance on more recent data. Finally, we did not attempt to identify patients with drug‐resistant epilepsy, for whom detection algorithms have been proposed in claims databases. 25 , 26

5. CONCLUSION

In this study, both the rule‐based method and the pretrained language model achieved strong results. The pretrained language model achieved the best performances with an F1 score of .95 (95% CI: .95–.96) on the sentence‐level evaluation and .95 (95% CI: .94–.96) on the patient‐level evaluation. The strong performances of our two methods suggest that applying similar approaches to non‐English texts would also yield significant results. The key steps of our two approaches are summarized in Supporting Information Method 3, which may serve as a practical guide for teams interested in developing their own NLP‐based algorithm.

These two methods enable rapid, large‐scale, and reliable identification of patients on French medical corpus, facilitating the construction of comprehensive epidemiological cohorts. Such cohorts can include not only patients with epilepsy‐related diagnosis codes but also those identified through medical consultations without explicit coding. We plan to extend the automatic diagnostic framework to encompass other forms of epilepsy, such as drug‐resistant epilepsy.

AUTHOR CONTRIBUTIONS

François Le Gac and Quentin Calonge jointly contributed to the development of the study and the writing of the manuscript. Dr. Calonge was primarily responsible for the clinical aspects of the study, whereas François Le Gac led the development and documentation of the natural language processing methods. Dr. Estellat critically reviewed the manuscript and contributed her expertise in epidemiology to the interpretation of the findings. Pr. Navarro supervised the study, reviewed the manuscript, and provided strategic direction throughout the project.

FUNDING INFORMATION

This work was supported by a public grant overseen by the Agence Nationale de la Recherche (ANR) as part of the “Investissements d'Avenir” ANR‐10‐IAIHU‐06, and by a public grant of the Données de santé et Applications (DAtAE; project Epi2).

CONFLICT OF INTEREST STATEMENT

None of the authors has any conflict of interest to disclose. We confirm that we have read the Journal's position on issues involved in ethical publication and affirm that this report is consistent with those guidelines.

ETHICS APPROVAL STATEMENT

This study was approved by the institutional review board of Assistance Publique–Hôpitaux de Paris (AP‐HP) The study used de‐identified patient data obtained from the AP‐HP Clinical Data Warehouse (Entrepôt de Données de Santé, EDS). Access to the data was granted following approval by the local ethics committee and compliance with the procedures described at www.eds.aphp.fr.

Supporting information

Figure S1. Confusion matrices of the three methods on a test set of 6733 sentences.

EPI-67-741-s001.tif (1.1MB, tif)

Figure S2. Confusion matrices of the three methods on a test set of 3000 patients.

EPI-67-741-s003.tif (1.3MB, tif)

Figure S3. Pretrained language model F1 scores as functions of the decision threshold. The average F1 score is the mean between the sentence evaluation F1 score and the patient evaluation F1 score.

EPI-67-741-s004.tif (787.5KB, tif)

Figure S4. Patient‐level performance of the three methods across seven different strategies: “S1: if at least one mention was found in one or more clinical notes of the patient, the patient is epileptic”; “S2: if at least two mentions were found in one or more clinical notes of the patient, the patient is epileptic”; “S3: if at least three mentions were found in one or more clinical notes of the patient, the patient is epileptic”; …

EPI-67-741-s005.tif (1MB, tif)

Data S1.

EPI-67-741-s002.docx (56.3KB, docx)

ACKNOWLEDGMENTS

None.

DATA AVAILABILITY STATEMENT

Access to the Clinical Data Warehouse's de‐identified raw data can be granted following the process described on its website: www.eds.aphp.fr. A prior validation of the access by the local institutional review board is required. In the case of non–Assistance Publique–Hôpitaux de Paris (AP‐HP) researchers, the signature of a collaboration contract is moreover mandatory.

REFERENCES

  • 1. Fisher RS, Acevedo C, Arzimanoglou A, Bogacz A, Cross JH, Elger CE, et al. ILAE Official Report: a practical clinical definition of epilepsy. Epilepsia. 2014. Apr;55(4):475–482. [DOI] [PubMed] [Google Scholar]
  • 2. Beghi E, Giussani G, Nichols E, Abd‐Allah F, Abdela J, Abdelalim A, et al. Global, regional, and national burden of epilepsy, 1990–2016: a systematic analysis for the global burden of disease study 2016. Lancet Neurol. 2019. Apr;18(4):357–375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Fiest KM, Sauro KM, Wiebe S, Patten SB, Kwon CS, Dykeman J, et al. Prevalence and incidence of epilepsy: a systematic review and meta‐analysis of international studies. Neurology. 2017. Jan 17;88(3):296–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Mbwana JS, Grinspan ZM, Bailey R, Berl M, Buchhalter J, Bumbut A, et al. Using EHRs to advance epilepsy care. Neur Clin Pract. 2019. Feb;9(1):83–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Franchi C, Giussani G, Messina P, Montesano M, Romi S, Nobili A, et al. Validation of healthcare administrative data for the diagnosis of epilepsy. J Epidemiol Community Health. 2013;67(12):1019–1024. [DOI] [PubMed] [Google Scholar]
  • 6. Yew ANJ, Schraagen M, Otte WM, Van Diessen E. Transforming epilepsy research: a systematic review on natural language processing applications. Epilepsia. 2023. Feb;64(2):292–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Fernandes M, Cardall A, Jing J, Ge W, Moura LMVR, Jacobs C, et al. Identification of patients with epilepsy using automated electronic health records phenotyping. Epilepsia. 2023. Jun;64(6):1472–1481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, et al. The clinician and dataset shift in artificial intelligence. N Engl J Med. 2021. Jul 15;385(3):283–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Abbara S, Guillemot D, El Oualydy S, Kos M, Poret C, Breant S, et al. Antimicrobial resistance and mortality in hospitalized patients with bacteremia in the greater Paris area from 2016 to 2019. CLEP. 2022. Dec;14:1547–1560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Zweigenbaum P, Grouin C, Lavergne T. Une catégorisation de fins de lignes non‐supervisée. Actes de la conférence conjointe JEP‐TALN‐RECITAL 2016, Paris, France. Volume 2. Paris, France: AFCP ‐ ATALA; 2016. p. 364–371 https://aclanthology.org/2016.jeptalnrecital‐poster.7 [Google Scholar]
  • 11. Wajsburt P, Petit‐Jean T, Dura B, Cohen A, Jean C, Bey R. EDS‐NLP: efficient information extraction from French clinical notes [Internet]. Paris, France: Zenodo; 2022. https://aphp.github.io/edsnlp/latest [Google Scholar]
  • 12. Martin L, Muller B, Ortiz Suárez PJ, Dupont Y, Romary L, De La Clergerie É, et al. CamemBERT: a Tasty French Language Model. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics [Internet]. Paris, France: Online: Association for Computational Linguistics; 2020. [cited 2025 Feb 10]. p. 7203–7219 https://www.aclweb.org/anthology/2020.acl‐main.645 [Google Scholar]
  • 13. Saito T, Rehmsmeier M. The precision‐recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015. Mar 4;10(3):e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Abeysinghe R, Tao S, Lhatoo SD, Zhang GQ, Cui L. Leveraging pretrained language models for seizure frequency extraction from epilepsy evaluation reports. npj Digit Med. 2025;8(1):208. 10.1038/s41746-025-01592-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Falter M, Godderis D, Scherrenberg M, Kizilkilic SE, Xu L, Mertens M, et al. Using natural language processing for automated classification of disease and to identify misclassified ICD codes in cardiac disease. Eur Heart J Digital Health. 2024. May 1;5(3):229–234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Johnson SA, Signor EA, Lappe KL, Shi J, Jenkins SL, Wikstrom SW, et al. A comparison of natural language processing to ICD‐10 codes for identification and characterization of pulmonary embolism. Thromb Res. 2021. Jul 1;203:190–195. [DOI] [PubMed] [Google Scholar]
  • 17. Bellini I, Policardo L, Zaccara G, Palumbo P, Rosati E, Torre E, et al. Identification of prevalent patients with epilepsy using administrative data: the Tuscany experience. Neurol Sci. 2017. Apr;38(4):571–577. [DOI] [PubMed] [Google Scholar]
  • 18. Smith JR, Jones FJS, Fureman BE, Buchhalter JR, Herman ST, Ayub N, et al. Accuracy of ICD‐10‐CM claims‐based definitions for epilepsy and seizure type. Epilepsy Res. 2020. Oct;166:106414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Mbizvo GK, Bennett KH, Schnier C, Simpson CR, Duncan SE, Chin RFM. The accuracy of using administrative healthcare data to identify epilepsy cases: a systematic review of validation studies. Epilepsia. 2020;61(7):1319–1335. [DOI] [PubMed] [Google Scholar]
  • 20. Hamid H, Fodeh SJ, Lizama AG, Czlapinski R, Pugh MJ, LaFrance WC, et al. Validating a natural language processing tool to exclude psychogenic nonepileptic seizures in electronic medical record‐based epilepsy research. Epilepsy Behav. 2013;29(3):578–580. [DOI] [PubMed] [Google Scholar]
  • 21. Pevy N, Christensen H, Walker T, Reuber M. Feasibility of using an automated analysis of formulation effort in patients' spoken seizure descriptions in the differential diagnosis of epileptic and nonepileptic seizures. Seizure. 2021. Oct;91:141–145. [DOI] [PubMed] [Google Scholar]
  • 22. Connolly B, Matykiewicz P, Bretonnel Cohen K, Standridge SM, Glauser TA, Dlugos DJ, et al. Assessing the similarity of surface linguistic features related to epilepsy across pediatric hospitals. J Am Med Inform Assoc. 2014;21(5):866–870. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Petit‐Jean T, Gérardin C, Berthelot E, Chatellier G, Frank M, Tannier X, et al. Collaborative and privacy‐enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions. J Am Med Inform Assoc. 2024. Apr 4;31(6):1280–1290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Bey R, Cohen A, Trebossen V, Dura B, Geoffroy PA, Jean C, et al. Natural language processing of multi‐hospital electronic health records for public health surveillance of suicidality. npj Mental Health Res. 2024. Feb 14;3(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Hill CE, Lin CC, Terman SW, Rath S, Parent JM, Skolarus LE, et al. Definitions of drug‐resistant epilepsy for administrative claims data research. Neurology. 2021;97(13):e1343–e1350. 10.1212/WNL.0000000000012514 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Bølling‐Ladegaard E, Dreier JW, Christensen J. An algorithm for drug‐resistant epilepsy in Danish national registers. Brain. 2025;148(3):753–763. 10.1093/brain/awae286 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1. Confusion matrices of the three methods on a test set of 6733 sentences.

EPI-67-741-s001.tif (1.1MB, tif)

Figure S2. Confusion matrices of the three methods on a test set of 3000 patients.

EPI-67-741-s003.tif (1.3MB, tif)

Figure S3. Pretrained language model F1 scores as functions of the decision threshold. The average F1 score is the mean between the sentence evaluation F1 score and the patient evaluation F1 score.

EPI-67-741-s004.tif (787.5KB, tif)

Figure S4. Patient‐level performance of the three methods across seven different strategies: “S1: if at least one mention was found in one or more clinical notes of the patient, the patient is epileptic”; “S2: if at least two mentions were found in one or more clinical notes of the patient, the patient is epileptic”; “S3: if at least three mentions were found in one or more clinical notes of the patient, the patient is epileptic”; …

EPI-67-741-s005.tif (1MB, tif)

Data S1.

EPI-67-741-s002.docx (56.3KB, docx)

Data Availability Statement

Access to the Clinical Data Warehouse's de‐identified raw data can be granted following the process described on its website: www.eds.aphp.fr. A prior validation of the access by the local institutional review board is required. In the case of non–Assistance Publique–Hôpitaux de Paris (AP‐HP) researchers, the signature of a collaboration contract is moreover mandatory.


Articles from Epilepsia are provided here courtesy of Wiley

RESOURCES