Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2025 Sep 12;32(11):1663–1673. doi: 10.1093/jamia/ocaf155

Comparison of rule- and large language model-based phenotype extraction from clinical notes for neurofibromatosis type 1

Levi Kaster 1,#, Ethan Hillis 2,#, Inez Y Oh 3, Elizabeth C Cordell 4, Randi E Foraker 5, Albert M Lai 6, Stephanie M Morris 7, David H Gutmann 8, Philip R O Payne 9, Aditi Gupta 10,
PMCID: PMC12626218  PMID: 40966762

Abstract

Introduction

Neurofibromatosis type 1 (NF1) is a rare genetic disorder affecting multiple organ systems with significant clinical heterogeneity. Managing individuals with NF1 is challenging due to variability in disease progression and outcomes and limited early risk assessment tools.

Objective

This study aims to develop an effective, generalizable, user-friendly clinical entity extraction pipeline for identifying NF1-related phenotypes from unstructured clinical notes to enhance research and risk-modeling efforts. We compare the benefits of rule-based natural language processing (NLP) vs large language models (LLMs) for this purpose.

Materials and Methods

Four phenotype extraction pipelines (3 LLM-based vs 1 rule-based) were developed to automatically extract selected NF1-relevant phenotypes. Subject matter experts manually reviewed clinical notes, generating a gold-standard annotation dataset for evaluation. In Phase 1, notes authored by a single NF1 physician were used to guide pipeline development and refinement. In Phase 2, notes from a second NF1 physician were used to assess pipeline generalizability, followed by further refinement to accommodate differences in physician terminology.

Results

With refinement, the rule-based model had higher distributions of F1 scores than the LLMs in both Phase 1 and Phase 2. However, the LLMs demonstrated better generalizability between physicians without refinement, showing lesser performance decreases (4.4%-5.1%) when transitioning from Phase 1 to Phase 2 without refinement, compared to an 8.8% decrease for the rule-based model.

Conclusion

We highlight trade-offs between the effectiveness of rule-based NLP vs generalizability and ease of implementation of LLMs for clinical entity extraction, with implications for pipeline portability across providers and institutions.

Keywords: large language models, neurofibromatosis type 1, information extraction, natural language processing, phenotyping

Background and significance

Neurofibromatosis type 1 (NF1) is a rare autosomal dominant syndrome caused by germline mutations in the NF1 gene, affecting 1 in 2800 individuals worldwide.1 NF1 is characterized by extreme clinical heterogeneity, affecting multiple organ systems with varying levels of severity. Common manifestations include neurocognitive and behavioral abnormalities, such as attention deficit hyperactivity disorder (ADHD) and autism, as well as peripheral (eg, neurofibromas) and central nervous tumors (eg, optic pathway gliomas [OPGs]).2–4 As a result, individuals with NF1 exhibit a reduced life expectancy chiefly due to an elevated risk of malignancy.3,4

Currently, the management of NF1 is largely reactive with a paucity of clinical decision support tools to aid with risk stratification and prognostic assessment.2,3 The development of machine learning (ML) models that predict individual disease course could lead to tailored disease surveillance and more effective preventative measures. With respect to risk assessment, recent studies have demonstrated that sex5–7 and germline genetics (specific NF1 gene mutation)8–11 are risk factors for some NF1-associated phenotypes. Additionally, our group has previously demonstrated that ML using structured clinical registry and electronic health record (EHR) data can predict development of OPG and ADHD in children with NF1.12

While structured EHR data contains features predictive of future disease course, the data most relevant to NF1 and other rare disease phenotypes or comorbid conditions are often stored in domain-specific clinical registries and within unstructured EHR data. Since registries can be expensive to maintain, may contain duplicate, incomplete, or inaccurate information, and may be biased towards containing patients from more advantaged socioeconomic backgrounds,13,14 routinely-collected unstructured EHR data may be more representative for studying rare disorders.

As documented by previous studies, informative data can be extracted from unstructured clinical text.15–17 Rule-based natural language processing (NLP) pipelines were the first methods employed for phenotype extraction from notes, achieving high degrees of success when standardized terminology is used in documentation.18–20 These methods specify exact phrases expected in the text, leading to highly effective extraction, despite potentially time-consuming rule development. As a result, rule-based methods continue to be used for extraction despite more complex alternatives.18–21

Pre-trained large language models (LLMs) have emerged as an alternative to traditional NLP for entity extraction from unstructured text.21,22 These models complete a variety of general-purpose tasks, including information extraction, without the need for fine-tuning.23,24 Among the highest performing of these LLMs is GPT-4, a closed-source model released by OpenAI in 2023.25 In the short time since its release, many researchers have investigated its potential application in clinical contexts, such as information extraction, clinical summarization, and question answering.22,26–30 Open-source models also demonstrate high performance across various tasks, successfully extracting information on local machines, while also being potentially more scalable due to lower costs associated with implementation and execution. For example, Gemma 3 provides a 27 billion model that operates on a single GPU and rivals models 10 times its size. Similarly, DeepSeek has released small, distilled versions of their DeepSeek-R1 model that achieve high performance with chain-of-thought responses, potentially enhancing extraction performance due to improved reasoning.31,32 Given the range of potential methods and models for phenotype extraction, each offering unique advantages and limitations,21,33–36 we built a rule-based NLP pipeline and compared its performance to 3 LLM-extraction pipelines utilizing GPT-4, Gemma3-27 billion (Gemma3-27B), and DeepSeek-R1-14 billion distilled from the 14 billion parameter Qwen2 model (DeepSeek-14B).

Objective

The aim of this study was to develop a clinical entity extraction pipeline for identifying NF1-related phenotypes in clinical progress notes, enabling the replicable extraction of phenotypes from information that is regularly captured in the EHR. Given these aims, our project had 3 main research questions: First, can a generalizable phenotype extraction pipeline for NF1 be developed? Second, how do the phenotype extraction capabilities of rule-based NLP pipelines compare to those of LLM-based pipelines that leverage the models’ massive pre-training corpora to confer generalizability? Third, what are the benefits and use cases for each method with specific focus on generalizability and ease of use?

Materials and methods

Study cohort

This study was approved by the Washington University Institutional Review Board (IRB #201706112) and granted a waiver of HIPAA authorization for the use of protected health information (PHI). This was a retrospective study of patient encounters between June 2018 and April 2023 using EHRs from Washington University School of Medicine/Barnes-Jewish Hospital. The cohort included children (<18 years old) with a NF1 diagnosis (ICD-10 code Q85.01) and individuals with mosaic (segmental) NF1 were excluded. The dataset included progress notes, corresponding note metadata (eg, author name, encounter datetime, and note type), and demographics.

Identify phenotypes for extraction

We selected 32 NF1-related phenotypes for extraction, chosen using the following criteria: (1) physician-perceived extent of documentation in clinical progress notes by physician subject matter experts (SMEs) from 3 institutions (Washington University in St Louis, Kennedy Kreiger Institute and Lurie Children’s Hospital), (2) importance to pediatric NF1 prognosis as determined by physician SMEs, (3) phenotypes with high prevalence and impact on quality of life. The final phenotypes included family history data, mental health disorders, heart diseases, tumors, and additional relevant data (Table S1).

Model development

Four total phenotype extraction pipelines were built to facilitate the automatic extraction of NF1-related phenotypes, including one rule-based pipeline and 3 LLM-based pipelines. NF1-specific clinical progress notes were extracted from the EHR to assess these methods. Preliminary versions of the pipelines were developed through collaboration with physicians and were tested on ∼10 randomly selected NF1 progress notes to ensure pipeline functionality. Two sets of notes were chosen for further pipeline development and analysis, with each containing notes from different physician authors from the same institution. Notes from separate physicians were selected to allow us to measure the intra-institutional generalizability of the pipelines. SME annotations for these note sets were developed (as outlined in Creation of Gold-Standard Annotations) and the pipelines were refined and evaluated against the annotations in 2 phases.

Phase 1 consisted of the refinement and evaluation of the pipelines on 100 clinical notes from a single physician. To select 100 notes, the preliminary rule-based NLP model was used to estimate phenotype prevalence, and the 100 highest prevalence notes from the first physician, restricted to one note per individual, were selected. The pipeline was then tested on 17 validation clinical notes from the same physician to ensure consistency of performance, and these notes were not used for model training. These validation notes were randomly selected and restricted to one note per individual, consisting of no overlapping patients with the 100 training notes. In Phase 2, the refined pipelines from phase 1 were evaluated on 30 unseen clinical notes from a separate physician, which were randomly selected from all pediatric progress notes authored by the second physician, also restricted to one per individual. All pipelines were then updated to account for the new physician’s terminology, and a post-modification evaluation was performed. Figure 1 illustrates the workflow used for analysis and pipeline improvement.

Figure 1.

Flowchart figure displaying the general project workflow, showing the division of the pre-processed notes into 100 Phase 1 notes and 30 Phase 2 Notes. The pipelines applied to each set of notes are depicted, with the iteratively refined pipelines from Phase 1 being applied to Phase 2.

Figure illustrating the general project workflow. To select the 100 notes for phase 1 the preliminary rule-based NLP model was used to estimate phenotype prevalence, and the highest prevalence notes (N = 100) from the first physician, restricted to one note per individual, were selected. The 30 notes used in phase 2 were randomly selected without stratification from all pediatric progress notes authored by the second physician, also restricted to one per individual. All evaluations were performed by comparing pipeline predictions against SME annotations, as described in Pipeline Evaluation.

Rule-based pipeline

The rule-based phenotype extraction pipeline was built using physician input to guide the creation and refinement of phrase-matching NLP rules (Figure 1). We compared the results of the initial pipeline to the labeled SME-annotated notes and used these annotations to iteratively refine the pipeline’s rules. The pipeline was implemented utilizing the target and context matching components of MedSpaCy, a free Python toolkit built on SpaCy to assist with clinical NLP tasks.37 Customized rules were provided to the pipeline through JSON files, which were easily modified through editing a corresponding CSV.

Preprocessing

Custom preprocessing on the extracted note strings was performed to improve sentence splitting. Steps taken included utilizing regular expressions to remove physician descriptions of expected findings for a general NF1 patient and removing extraneous spaces. These steps were performed in Python (Version 3.8.5), and the packages used included Pandas38 (Version 2.0.3), NumPy39 (1.24.3), and Regex.

Rule-based NLP pipeline

The rule-based NLP pipeline consists of 4 steps: (1) sentence splitting, (2) identifying target phenotypes, (3) locating contextual information, and (4) determining phenotypic presence. For each clinical note, all phenotypes were marked as positive, negative, or unknown. A general description of each pipeline step is provided below:

1. Sentence splitting

PyRuSH, the Python implementation of RuSH, was used to split the clinical notes into component sentences.40 This process identifies natural boundaries within the text, ensuring that contextual and phenotypic information are only linked if they exist within the same sentence.

2. Identify phenotype references

Next, phenotype references were identified according to rules specified in the “target rule” JSON and CSV files. These rules are intended to encompass all combinations of phenotype-identifying phrases in the text, and multiple rules can be associated with a single phenotype. Table 1 demonstrates a subset of rules for detecting OPG and T2-hyperintensities.

Table 1.

Table displaying a rule for identifying OPG and T2-hyperintensity references within the text.

Literal Category Pattern 1 Pattern 2 Pattern 3 Pattern 4
OPG OPG optic pathway, nerve, ? glioma, gliomas, tumor
Optic nerves are normal Negative OPG optic pathway, nerves are, is, were, appear, ? normal, typical
T2-Hyperintense T2 Hyperintensities t2 −,/, ? signal, ? hyperintensities, hyperintensity, hyperintense, hyperintense

The rules follow this structure in the modifiable CSV file.

In Table 1, the literal column gives an example of what the phrase looks like, and the pattern columns provide combinations of strings that can occur in the identifying phrase. The Pattern 1 column corresponds to the possible first word(s) in the phrase, the Pattern 2 column corresponds to the possible second word(s), and so forth. The “?” value signifies that the word may be omitted as a part of the identifying phrase. According to the rules above, “optic pathway glioma” and “optic gliomas” would both be identified as positive OPG references. The phrases “optic nerves are normal” and “optic nerves appear typical” would be identified as negative OPG references. Negative references can also be identified by pairing a negative contextual phrase with a positive phenotype reference.

3. Identify contextual information

Each phenotype reference may also have associated contextual information. This contextual information is similarly detected through pattern-matching rules specified through “context rule” JSON and CSV files. Each rule is classified as a negating phrase (absence of, can rule out, etc), hypothetical phrase (risks of, not ruled out, etc), or a phrase indicating phenotype severity and/or family history (low, maternal grandfather, etc), which helps determine if the overall phenotype is negative, unknown, or adds additional granularity.

4. Logic to determine phenotype presence/absence

The final pipeline step is to utilize the extracted phenotypic and contextual information to determine note-level phenotypic presence. According to the associated contextual information, each phenotype reference in the text is marked as positive, negative, or unknown. Figure 2 displays 3 example sentences, displaying how extracted phenotype and contextual information leads to determinations about phenotype presence. Note-level phenotype presence is determined according to majority voting of the identified positive and negative phenotype references. If the count of both positive and negative references is 0, then the phenotype is unknown. If positive and negative references are equal and non-zero, then the note-level phenotype is positive.

Figure 2.

Figure displaying three example sentences that show the rule-based model's identification of a negative phenotype, unknown phenotype, and positive phenotype respectively. The relevant contextual and phenotype reference phrases from each sentence are circled.

Example of final rule-based NLP pipeline output across 3 sentences and phenotypes.

Knowledge-based extraction

In addition to the standard pipeline, a knowledge-based approach is also integrated to improve the identification of macrocephaly and hypertension. For hypertension, regular expression matching is used to extract blood pressure values. The extracted results and corresponding demographic information are then compared against the American Academy of Pediatrics pediatric hypertension minimums to determine phenotypic presence or absence.41

LLM-based pipelines

Three LLM-based pipelines were built to extract the phenotypes, utilizing GPT-4, Gemma3-27B, and DeepSeek-14B. The GPT-4 pipeline was built within Microsoft’s Azure OpenAI Service, and a GPT-4 (version 0613) endpoint was deployed through a HIPAA-compliant subscription within Washington University’s Azure Tenant. The Gemma3-27B and DeepSeek-14B pipelines were run on a single local and HIPAA-secure NVIDIA-A100 40GB GPU, deployed utilizing a local Ollama server.42 The GPT model had its temperature argument set to 0 and the local LLMs had their temperature set to 0.1 to facilitate deterministic data extraction. To standardize model output, all prompts included a key-value pair dictionary template that provided each entity as a key and a list of possible labels for each entity (yes, no, or no information). Additionally, previous reports have shown that prompt tuning significantly alters model outputs when performing complicated tasks.43 Thus, for all prompts, the system role was asserted as a physician with the goal of identifying and extracting patient conditions from their clinical notes, shown in Figure 3.

Figure 3.

Figure displaying the text utilized to assert the system role for all large language models. The model is told that it is a physician with the goal of identifying and extracting conditions from the clinical notes text of patients with Neurofibromatosis Type 1.

System role for all GPT-4 prompts.

Prompt engineering

The same prompts were used for all LLM models, and each prompt included a description of the extraction task, definitions for each phenotype, an example extraction format, and the entire clinical note. An initial prompt with these components was developed and evaluated against the Phase 1 clinical notes using GPT-4, before undergoing 3 subsequent refinements aimed to improve performance. These refinements addressed phenotype-level errors and reduced prompt complexity by pointing to locations within the text to find phenotypes and extracting the 32 phenotypes through 3 queries instead of one. The model was instructed to return the justification for each extracted phenotype, allowing the incorrect predictions to be more thoroughly reviewed. Table 2 describes the prompt content during each iteration, starting with the initial prompt, followed by sectioning and pointing iteration, then the partial pointing iteration, and the final split extractions iteration. Though preprocessing to split the note into standardized note sections and subsections was investigated in early pipeline iterations, the entire unprocessed clinical note was provided to the model alongside the final split extractions prompts. The final system and guidelines prompts are found in Text S1.

Table 2.

Description of the iteratively updated prompts to improve LLM phenotype extraction.

Name Description
Basic All entities are listed and defined with the inclusion of other terminologies each entity could be written as. This prompt used raw, unprocessed notes.
Sectioning and Pointing Using the basic prompt as a starting point, the section location of each entity is included and the LLM is told to look for the entity within that section. The LLM was also asked to provide justification for each extracted entity. This prompt used the preprocessed notes, split into sections by headers.
Partial Pointing Using the basic prompt as a starting point, the section location of only entities that had a performance improvement from basic to sectioning and pointing were included. These entities included scoliosis, T2 hyperintensities, heart murmur, hypertension, tone, anxiety, and macrocephaly. The LLM was also asked to provide justification for each extracted entity. This prompt used raw, unprocessed notes.
Split Extractions The partial pointing prompt was taken and split into 3 separate prompts, each extracting a different subset of the phenotypes. One prompt extracted the first 11 phenotypes, the second extracted the next 11, and the final prompt extracted the last 10. In each prompt, the LLM was also asked to provide justification for each extracted entity. This prompt used raw, unprocessed notes.

LLM extraction workflow

Prompt Engineering was conducted through iterative evaluation and prompt refinement on the Phase 1 clinical notes utilizing GPT-4 for extraction. Upon developing the final prompt, we extracted the phenotypes from the Phase 1 notes using GPT-4, Gemma3-27B, and DeepSeek-14B, obtaining our Phase 1 extraction results. Next, we used this final prompt to obtain the extraction results for all LLM models on the Phase 2 notes. Finally, we reviewed the performance of the pipelines on the new note set and slightly modified some entity definitions within the prompt to account for new physician terminology. The extraction process was repeated using the refined prompt for all 3 models to obtain the final Phase 2 performances.

Pipeline evaluation

The pipeline predictions were evaluated against manually curated gold-standard annotations. Precision, recall, weighted F1, and macro-averaged F1 scores were calculated for the assessment of model performance. Statistical significance between distributions of weighted F1-scores were calculated between the GPT-4 and rule-based pipelines and across phases utilizing the Wilcoxon signed-rank test with the Bonferroni correction.44 All evaluations were performed in Python using scikit-learn and scipy.45,46

Gold-standard annotations

Two of our SMEs performed manual review of the 147 clinical notes selected for Phase 1, the validation set, and Phase 2, entering their annotations in a REDCap survey to collect their responses. Figure S1 shows a snippet of the codebook for the survey.47 For each note, the SMEs marked each phenotype as positive, negative, or unknown, except for the tone phenotype, which received annotations of high, normal, low, or unknown. A phenotype was marked as unknown if it was not mentioned in the note or if the note did not clearly indicate the patient’s status regarding the phenotype. We identified all disagreements between the SMEs and provided a list of these disagreements to the more senior SME, a senior neurologist specializing in NF1. This senior SME reviewed the divergence and provided the final classification ruling of positive, negative, or unknown. The presence of each phenotype within the notes varied greatly, with some being mentioned in >90% of notes, and other phenotypes being mentioned in <10% of notes. Table S1 displays the annotation frequency for each phenotype.

Inter-annotator agreement

Inter-annotator reliability was assessed after the first round of annotations using Cohen’s kappa. Agreement varied widely among phenotypes, with the lowest Cohen’s kappa scores in phenotypes with uneven categorizations (eg, CALMs, heart murmur). Overall, we found moderate Cohen’s kappa agreement and high percent agreement across phase 1 and 2 note sets. The middle 50 percentile for kappa and percent agreement ranged from 0.74 to 1 and 92% to 99%, respectively.

Results

Cohort characteristics

The pediatric NF1 cohort comprised of 1085 individuals, 708 of whom had NF1 clinical progress notes available (Table 3). The 708 individuals with progress notes had a median age of 2.7 years at first encounter, with a race distribution of 80.1% White, 12.9% Black, 3.7% Asian, and 3.4% unknown/other race. 147 notes from unique individuals were selected for annotation and analysis in Phase 1, the validation set, and Phase 2. The demographics of the individuals chosen for annotation were not significantly different from those with progress notes not chosen for annotation.

Table 3.

Demographics of study cohort.

Pediatric NF1 cohort Pediatric NF1 cohort with progress notes Phase 1, Validation, and Phase 2 Datasets P-value
Number of individuals N = 1085 (100%) N = 708 (100%) N = 147 (13.5%)
Sex
Female, N (%) 534 (49.22%) 339 (47.88%) 75 (51.02%) .39
Male, N (%) 551 (50.78%) 369 (52.12%) 72 (48.98%) .39
Age, median (IQR) 4.4 (1.0-9.4) 2.7 (0.6-6.8) 2.2 (0.8-4.8) .09
Race
Asian, N (%) 29 (2.7%) 26 (3.7%) 3 (2.0%) .53
Black, N (%) 169 (15.6%) 91 (12.9%) 16 (10.9%)
White, N (%) 830 (76.5%) 567 (80.1%) 125 (85.0%)
Other/unknown N (%) 57 (5.3%) 24 (3.4%) 4 (2.7%)

P-values were calculated using the z-test for proportions for sex, Mann-Whitney test for age, and the fisher-exact test for race. These P-values represent differences between pediatric NF1 patients with progress notes included in Phases 1, the Validation Set, and Phase 2 and those with progress notes not included.

Pipeline performance

The rule-based NLP and LLM pipelines were used to extract all 32 phenotypes from both the Phase 1 and Phase 2 sets, and extraction results were evaluated against the gold-standard annotations. Table 4 compares the extraction performances of the Rule-based and GPT-4 pipelines across both phases, displaying weighted F1-scores. Two performances are reported for the Phase 2 extraction, one baseline performance and one performance after pipeline refinement to account for new terminology. The bolded values represent the top-performing model for that phenotype and phase. Phase 2 scores for Asthma and MPNST are not reported because these phenotypes received an annotation of “unknown” for all notes in the phase. Figure 4 contains 2 subplots displaying the distribution of F1-scores. Subplot A shows the distribution of scores for all models and phases, while Subplot B displays the statistical significance of differences in F1-score distributions between the rule-based and GPT pipelines. Table S2 displays the weighted-F1 scores across all pipeline phases for the Gemma3-27B and DeepSeek-14B pipelines and phases. Tables S3-S9 display macro-F1 scores and additional precision/recall data for all 4 pipelines.

Table 4.

Performance comparison across evaluation phases: weighted F1 scores.

Phenotype Rule-based NLP performance
GPT-4 performance
Phase 1 F1-Score (N = 100) Phase 2 F1-Score (N = 30) Phase 2 F1-Score with Updated Rules (N = 30) Phase 1 F1-Score (N = 100) Phase 2 F1-Score (N = 30) Phase 2 F1-Score with Updated Prompts (N = 30)
CALM Reference 1.00 0.90 1.00 0.99 1.00 1.00
Skinfold Freckling 0.98 0.84 1.00 0.96 0.97 0.97
Dermal Neurofibroma 0.96 0.90 0.97 0.99 1.00 1.00
Headaches 0.95 0.97 0.97 0.92 1.00 1.00
Heart Murmur 0.97 0.97 1.00 0.97 0.50 0.86
Family History Any 0.96 0.73 0.80 0.97 0.79 0.79
Plexiform Neurofibroma Any 0.95 0.57 0.93 0.92 0.89 0.89
Hypertension 0.92 0.87 0.93 0.69 0.84 0.84
Scoliosis 0.89 0.34 0.95 0.82 1.00 0.98
Macrocephaly 0.95 0.86 1.00 0.77 0.58 0.61
Dysplasia or Pseudarthrosis 0.98 0.50 0.97 0.85 0.18 0.89
Lisch nodules 0.97 0.93 0.93 0.98 0.97 1.00
Pes Planus 0.98 0.90 1.00 0.96 0.97 0.97
School Assistance 0.96 0.97 1.00 0.94 0.97 0.97
Numbness or Tingling 0.98 0.92 1.00 0.91 1.00 1.00
T2 Hyperintensities 0.86 0.90 0.98 0.90 0.95 0.95
Seizure/epilepsy 0.98 1.00 1.00 0.78 0.93 0.89
OPG 0.94 0.87 0.90 0.93 0.96 0.96
Fatigue 0.91 0.98 1.00 0.75 0.82 0.75
Anxiety 0.89 1.00 1.00 0.91 0.94 0.97
ADHD 0.91 0.90 1.00 0.84 0.84 0.84
Hydrocephalus 0.92 0.97 0.97 0.90 1.00 1.00
Eczema 0.99 0.97 1.00 0.97 0.98 0.98
Asthma 0.96 0.98 - -
NF1 microdeletion 0.94 0.82 0.86 0.90 0.82 0.69
Tortuous optic nerve 0.97 0.97 0.98 0.91 0.95 0.95
Precocious Puberty 0.96 0.64 0.97 0.95 0.64 0.85
Autism 0.91 0.96 1.00 0.94 0.81 0.81
Depression 0.96 0.98 1.00 0.90 0.97 0.95
MPNST 0.97 0.94 - -
Juvenile Xanthogranuloma 1.00 1.00 1.00 1.00 1.00 1.00
Tone 0.96 0.87 0.94 0.99 0.70 0.70

The bolded values represent the top-performing model for that phenotype and phase.

Figure 4.

Two boxplots labeled A and B that display the weighted-average F1-score performance of the 4 phenotype extraction pipelines across Phase 1 and Phase 2. Subplot A shows the distribution of performances for all pipelines. Subplot B compares the performances for just the Rule-based and GPT pipelines, with significance markers demonstrating a drop in Rule-based pipeline performance with the transition to Phase 2 and a rebound with refinement.

(A) Distribution of phenotype F1-scores for Rule-based, GPT-4, Gemma3-27B, and DeepSeek-14B models across both phases. Box and whisker plots show median and interquartile range (IQR), with whiskers extending to 1.5*IQR past the 1st and 3rd quartiles. Malignant peripheral nerve sheath tumor (MPNST) and asthma phenotypes are excluded due to lack of Phase 2 F1-scores. (B) Distribution of weighted F1-scores for the GPT-4 and Rule-based models including statistical tests for F1-score distribution differences. MPNST and asthma are again excluded from analysis. Tests for difference in F1-score distributions were performed within each phase, and within each pipeline across adjacent phases for 7 total comparisons. Statistical significance was calculated using the Wilcoxon signed-rank test with the Bonferroni correction. Only significant differences are shown in the plot, with * indicating P ≤ .05, ** indicating P ≤ .01, and *** indicating P ≤ .001.

Pipeline refinement

A major component of our project was refining both pipelines after the transition to phase 2, mitigating the performance drops associated with switching authors. Example updates to the rule-based pipeline are listed below.

  1. We corrected an error within a preprocessing function, where relevant text was erroneously removed. This change led to slight improvements across many phenotypes, such as dermal neurofibroma (0.90-0.97) and pes planus (0.90-1.00).

  2. We added a rule that identified and marked the phrases “Tanner stage 1/2” or “Tanner stage i/ii” as negative occurrences of precocious puberty. This led to a substantial improvement (0.64-0.97).

  3. We added a rule that identified and marked the phrase “the spine was straight” as a negative occurrence of scoliosis. This led to the greatest improvement that we observed (0.34-0.95).

The LLM pipelines were altered through editing prompts to add or edit instructions. Example updates to the pipelines are listed below.

  1. We added an instruction for the LLMs to mark the phrases “Tanner stage 1/2” or “Tanner stage i/ii” as negative occurrences of precocious puberty. This led to a performance improvement for all models, with F1-score improvements of 0.21, 0.51, and 0.42 for GPT, Gemma3-27B, and DeepSeek-14B, respectively.

  2. We added an instruction for the LLMs to mark the phrases “leg lengths: equal” or “leg lengths are equal” as negative occurrences of dysplasia or pseudoarthrosis. This led to improvement in all models, with F1-score improvements of 0.71, 0.33, and 0.72 for GPT, Gemma3-27B, and DeepSeek-14B, respectively.

  3. We added an instruction for the LLMs to label heart murmur as unknown rather than negative when a heart murmur was not explicitly mentioned. This led to a performance boost in all models, with F1-score improvements of 0.36, 0.29, and 0.10 for GPT, Gemma3-27B, and DeepSeek-14B, respectively.

Comparison of rule-based vs LLM pipeline performance

Both the rule-based model and LLMs achieved success in extracting most phenotypes; however, the rule-based pipeline obtained slightly higher F1-scores when all pipelines were refined. GPT outperformed other LLMs with >0.85 F1-scores on 26/32 phenotypes in phase 1 and 22/32 in phase 2 after refinement, compared to Gemma3-27B (21 in phase 1 and 19/32 in phase 2) and DeepSeek-14B (22/32 in phase 1 and 20/32 in phase 2). The rule-based model eclipsed GPT by having 32/32 phenotypes in phase 1 and 29/32 phenotypes in phase 2 with F1-scores of >0.85. Figure 4A further demonstrates the relative success of each method, as the rule-based model has the highest distribution of weighted-average F1-scores in phase 1 and phase 2 after refinement, followed by GPT, DeepSeek-14B, and Gemma3-27B. The refined rule-based model’s success is further demonstrated in Figure 4B, where the distribution of F1-scores for the NLP model was significantly greater than the GPT’s scores during the Phase 1 and Phase 2 after refinement sections. Further illustrating this trend, the mean phenotype F1-score for the NLP model was 0.950 and 0.968 for phase 1 and 2 respectively, compared to 0.907 and 0.902 for GPT, 0.848 and 0.853 for Gemma3-27B, and 0.875 and 0.874 for DeepSeek-14B.

To ensure that the performance on the Phase 1 notes accurately represented pipeline efficacy, the model was tested on 17 validation notes from the same physician. Overall, the models performed similarly on the Phase 1 notes and the validation notes, achieving median scores of 0.96 and 0.94 on the Phase 1 set and validation test set respectively for the NLP model, 0.925 and 0.90 for GPT, 0.91 and 0.93 for Gemma3-27B, and 0.91 and 0.90 for DeepSeek-14B. Figure S3 plots the distribution of weighted-F1 scores across the 100 notes from Phase 1 and the validation clinical notes, showing that for all models there is no significant difference between the distribution of F1-scores for these note groups. Finally, Tables S10 and S11 display the full phenotype level weighted- and macro-F1 scores for the model performances on these validation notes.

Although the rule-based model achieved slightly higher scores with the refined pipelines, the LLMs did a better job at maintaining performance when transitioning between Phase 1 and Phase 2 note sets. The rule-based pipeline experienced a statistically significant drop in F1 scores from Phase 1 to Phase 2, and then a substantial rebound after the rules are refined (Figure 4B). No statistically significant drop and rebound is observed with the LLMs. This trend is also seen in the average F1-score decreasing from phase 1 to the unmodified phase 2 by 0.083 (8.8% decrease) for the rule-based pipeline compared to 0.04 (4.9% decrease) for GPT, 0.037 (4.4% decrease) for Gemma3-27B, and 0.044 (5.1% decrease) for DeepSeek-14B. The same trends in model performance and generalizability hold when using macro-F1 scores and limiting evaluation to phenotypes with more balanced class distributions. Figure S2 demonstrates this, displaying the ranges of macro-F1 scores across pipelines for phenotypes with >10% minority class prevalence.

Discussion

Understanding the factors underlying development of specific disease phenotypes is essential for future implementation of precision medicine approaches. While disease-specific clinical registries, like a prior one designed by our team,48 can be developed for this purpose, they are expensive and resource-intensive to maintain. Additionally, data quality is highly dependent on the source, and in the case of patient registries,49 requires independent medical validation.50 Thus, EHRs may serve as comprehensive, readily-available, and inexpensive alternatives; however, the lack of clinically-relevant phenotypic data in the structured EHR necessitates utilizing NLP to extract this information. This is especially important for rare clinical disorders, where much of the clinically-relevant information is contained within clinical notes. Herein, we developed multiple pipelines to extract phenotypes relevant to NF1, comparing a traditional rule-based NLP approach against various LLM approaches.

In our study, we developed all pipelines by reviewing clinical notes from a single physician, then applying these models to the clinical notes of a different physician from the same institution. This allowed us to evaluate the generalizability of each extraction method intra-institutionally.

Our study demonstrates that the rule-based extraction pipeline outperformed the LLM pipelines after considerable effort was spent on rule development. Recent studies have corroborated this finding, showing that highly developed rule-based models can outperform LLMs on a case-by-case basis.21,51 Rule-based approaches provide additional benefits that should be considered when building an extraction pipeline, including ease of interpretability, extraction replicability,34–36 and established use with PHI.18,19

Despite the success of the rule-based model, it may not be practical for many clinical extraction problems due to the time-cost associated with developing rules. Rule development for NF1 is particularly labor-intensive due to the lack of a standardized NF1 lexicon used within the field. While there is consensus regarding the diagnostic criteria of NF1,52 the language used clinically to describe the criteria and associated clinical features of NF1 remains highly variable across providers and institutions. For example, OPG is a core diagnostic feature of NF1. However, it is referred to both clinically and within the scientific literature by various other names including optic nerve glioma, optic nerve tumor, visual pathway glioma, and chiasmatic glioma. Given this heterogeneity, the rule-based NLP pipeline was developed over months and required significant input from physician’s familiar with the clinical note structure and variable terminology used. Rule-based models are often designed to match the specific documentation practices of a single author, leading to concerns about generalizability. In our pipeline, we addressed these generalizability issues by developing an easy system for rule-customization utilizing CSVs. However, there remains a significant time-cost associated with refining the rules.

Conversely, the LLM-based pipelines were developed significantly faster and were readily refined through rapid prompt adjustments. The same prompts and pipeline structure could be utilized across GPT-4, Gemma-27B, and DeepSeek-14B, exemplifying the transferability of the LLM extraction pipeline across models. As new LLMs are developed in the future, these improved models could be used for extraction with minimal changes in pipeline structure, suggesting that the LLM extraction pipelines have high adaptability for use with evolving technologies.

In addition to different time-costs for pipeline development, each pipeline possessed different monetary and resource costs that could influence the choice of extraction method. The rule-based pipeline was free to run and was performed without GPU access, highlighting a low financial and resource burden. The local Gemma3-27B and Deepseek-14B pipelines required using a GPU, which involves an up-front investment of thousands of dollars. However, after the GPU is purchased, there are no query-level costs, making local language models a sustainable choice. Finally, the GPT pipeline required using an endpoint with a per-query fee. This approach can lead to high performance, but results in expenses that scale with the size of the extraction, potentially amounting to hundreds of dollars for a single extraction project on a large note set.30

We observed that GPT outperformed the local LLMs, showing the improved extraction performance that larger models can provide. While none of the LLMs exceeded the performance of our rule-based model, our results suggest that they conferred higher generalizability. This is particularly compelling given the significant variability that can exist between providers with respect to phraseology, clinical jargon, and abbreviations used in NF1 clinical notes. However, these results are not unexpected given the diverse corpora of data that LLMs are trained on and previously noted generalizability issues associated with rule-based models.53 The large variation in EHR documentation practice among institutions54 suggests that LLM-based clinical extraction models may be better suited for large-scale multi-institutional entity extraction problems. If LLMs are to achieve widespread cross-institutional implementation, privacy issues must be overcome. To utilize closed-source OpenAI models, HIPAA secure GPT endpoints are required if the model will interact with PHI, and the use of these endpoints is not currently widespread. Local models can be run using institutional infrastructure with less patient privacy concerns, but large-scale extraction with local models requires significant GPU resources. As secure GPT endpoints become more common and as more institutions obtain infrastructure to run local LLMs, the feasibility of using LLMs for large-scale clinical entity extraction problems may increase.

Study limitations and future work

Though our research provides valuable insights into the efficacy and preferential use cases for rule-based and LLM-based entity extraction pipelines, we acknowledge limitations of our study. Due to the nature and symptomology of NF1, many phenotypes we selected were rare or almost universally present. This led to an imbalance of positive/negative/unknown classes for some phenotypes in our annotations, summarized in Table S1. Despite this, we included these phenotypes in the analysis because this reflects the real-world application of extraction models to rare disease. Accounting for the imbalance, we have provided analysis with both weighted (Table 4) and macro-averaged F1 scores (Table S3).

Additionally, the significant effort of obtaining high-quality gold-standard annotations made a sample size greater than the 147 we obtained unfeasible. Most annotations were used for both analysis and pipeline development, and as such the pipeline refinements of Phase 2 were not evaluated against independent data; however, we feel that the notes used are reflective of the documentation patterns within the broader progress note set, and that the concurrent analysis and iterative improvement of the pipelines was the best choice given annotation scarcity. Addressing the need for analysis on notes that were not utilized for pipeline refinement, we tested our Phase 1 pipeline on a validation set of 17 notes, finding that performance was comparable between the Phase 1 notes used for pipeline refinements and the validation set.

Other limitations include the lack of a note set from an external institution. Applying our pipelines to external data would have been a preferred test generalizability, and we plan to address this in future studies. Next, our study did not include a reliability analysis to evaluate the stability of the LLM pipelines’ extractions, which was not conducted due to cost constraints. We took steps to reduce the inherent variability of LLMs, such as setting the temperature parameter to 0 for GPT and 0.1 for the local models.

Our plans for future work include applying, further developing, and analyzing these pipelines at a second institution to evaluate generalizability and assess the potential for broad multi-institutional use. Finally, we plan to incorporate the phenotype extraction results in ML models for NF1 risk stratification to determine if extracted phenotypes capture predictive meaning absent in the structured EHR.

Conclusion

Automated artificial intelligence methods like NLP were effectively utilized to develop a clinical entity extraction pipeline for identifying NF1-related phenotypes in clinical progress notes. Both rule-based and LLM based phenotype extraction pipelines successfully extracted all NF1-related phenotypes from clinical progress notes. The customized rule-based model achieved marginally higher performance but required significant input and time from clinical experts than the LLMs. The LLMs conferred better generalizability across note authors and the pipeline was more straightforward and quicker to develop. Overall, we demonstrate continued utility of rule-based clinical entity extraction pipelines, while also providing evidence of the potential for high-efficacy LLM-entity extraction models to break down institutional barriers.

Supplementary Material

ocaf155_Supplementary_Data

Contributor Information

Levi Kaster, Institute for Informatics, Data Science and Biostatistics, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.

Ethan Hillis, Institute for Informatics, Data Science and Biostatistics, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.

Inez Y Oh, Institute for Informatics, Data Science and Biostatistics, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.

Elizabeth C Cordell, Department of Neurology, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.

Randi E Foraker, Institute for Informatics, Data Science and Biostatistics, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.

Albert M Lai, Institute for Informatics, Data Science and Biostatistics, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.

Stephanie M Morris, Center for Autism Services, Science, and Innovation (CASSI), Kennedy Krieger Institute, Baltimore, MD 21211, United States.

David H Gutmann, Department of Neurology, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.

Philip R O Payne, Institute for Informatics, Data Science and Biostatistics, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.

Aditi Gupta, Institute for Informatics, Data Science and Biostatistics, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.

Author contributions

Levi Kaster (Conceptualization, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing—original draft, Writing—review & editing), Ethan Hillis (Conceptualization, Investigation, Methodology, Software, Validation, Writing—original draft, Writing—review & editing), Inez Oh (Conceptualization, Data curation, Methodology, Project administration, Visualization, Writing—original draft, Writing—review & editing), Elizabeth C. Cordell (Data curation, Writing—review & editing), Randi E. Foraker (Conceptualization, Writing—review & editing), Albert Max Lai (Conceptualization, Writing—review & editing), Stephanie M. Morris (Conceptualization, Data curation, Methodology, Supervision, Validation, Writing—review & editing), David Gutmann (Conceptualization, Methodology, Supervision, Validation, Writing—review & editing), Philip Richard Orrin Payne (Conceptualization, Funding acquisition, Supervision, Writing—review & editing), and Aditi Gupta (Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Validation, Writing—original draft, Writing—review & editing)

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This work was supported by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under grant number R01NS131112.

Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

Data availability

The patient level data underlying this article cannot be shared publicly in order to protect patient privacy. The data will be shared on reasonable request to the corresponding author.

References

  • 1. Huson SM, Compston DA, Clark P, Harper PS.  A genetic study of von Recklinghausen neurofibromatosis in south east Wales. I. Prevalence, fitness, mutation rate, and effect of parental transmission on severity. J Med Genet. 1989;26:704-711. 10.1136/jmg.26.11.704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Friedman JM.  Neurofibromatosis 1. University of Washington; 1993. [PubMed] [Google Scholar]
  • 3. Gutmann DH, Ferner RE, Listernick RH, Korf BR, Wolters PL, Johnson KJ.  Neurofibromatosis type 1. Nat Rev Dis Primers. 2017;3:17004. 10.1038/nrdp.2017.4. [DOI] [PubMed] [Google Scholar]
  • 4. Gutmann DH, Aylsworth A, Carey JC, et al.  The diagnostic evaluation and multidisciplinary management of neurofibromatosis 1 and neurofibromatosis 2. JAMA. 1997;278:51-57. 10.1001/jama.1997.03550010065042. [DOI] [PubMed] [Google Scholar]
  • 5. Diggs-Andrews KA, Brown JA, Gianino SM, Rubin JB, Wozniak DF, Gutmann DH.  Sex Is a major determinant of neuronal dysfunction in neurofibromatosis type 1. Ann Neurol. 2014;75:309-316. 10.1002/ana.24093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Morris SM, Acosta MT, Garg S, et al.  Disease burden and symptom structure of autism in neurofibromatosis type 1: a study of the international NF1-ASD consortium team (INFACT). JAMA Psychiatry. 2016;73:1276-1284. 10.1001/jamapsychiatry.2016.2600. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Chisholm AK, Lami F, Haebich KM, et al.  Sex- and age-related differences in autistic behaviours in children with neurofibromatosis type 1. J Autism Dev Disord. 2023;53:2835-2850. 10.1007/s10803-022-05571-6. [DOI] [PubMed] [Google Scholar]
  • 8. Koczkowska M, Callens T, Chen Y, et al.  Clinical spectrum of individuals with pathogenic NF1 missense variants affecting p.Met1149, p.Arg1276, and p.Lys1423: genotype–phenotype study in neurofibromatosis type 1. Hum Mutat. 2020;41:299-315. 10.1002/humu.23929. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Pasmant Eric, Sabbagh Audrey, Spurlock Gill, et al.  NF1 microdeletions in neurofibromatosis type 1: from genotype to phenotype. Hum Mutat, 2010;31:E1506-E1518. 10.1002/humu.21271. [DOI] [PubMed] [Google Scholar]
  • 10. Kang E, Kim Y-M, Seo GH, et al.  Phenotype categorization of neurofibromatosis type I and correlation to NF1 mutation types. J Hum Genet. 2020;65:79-89. 10.1038/s10038-019-0695-0. [DOI] [PubMed] [Google Scholar]
  • 11. Rojnueangnit K, Xie J, Gomes A, et al.  High incidence of Noonan syndrome features including short stature and pulmonic stenosis in patients carrying NF1 Missense mutations affecting p.Arg1809: genotype–phenotype correlation. Hum Mutat. 2015;36:1052-1063. 10.1002/humu.22832. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Morris SM, Gupta A, Kim S, Foraker RE, Gutmann DH, Payne PRO.  Predictive modeling for clinical features associated with neurofibromatosis type 1. Neurol Clin Pract  2021;11:e497-e505. 10.1212/CPJ.0000000000001089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Gliklich RE, Dreyer NA, Leavy MB, eds. Registries for Evaluating Patient Outcomes: A User’s Guide [Internet]. Vol. 2, 3rd ed., Publication No. 13(14)-EHC111. AHRQ; 2014. [PubMed] [Google Scholar]
  • 14. Hageman IC, van Rooij IALM, de Blaauw I, Trajanovska M, King SK.  A systematic overview of rare disease patient registries: challenges in design, quality management, and maintenance. Orphanet J Rare Dis. 2023;18:106. 10.1186/s13023-023-02719-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Wei W-Q, Teixeira PL, Mo H, Cronin RM, Warner JL, Denny JC.  Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. J Am Med Inform Assoc.2016;23:e20-e27. 10.1093/jamia/ocv130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Song J, Hobensack M, Bowles KH, et al.  Clinical notes: an untapped opportunity for improving risk prediction for hospitalization and emergency department visit during home health care. J Biomed Inform. 2022;128:104039. 10.1016/j.jbi.2022.104039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Ye J, Yao L, Shen J, Janarthanam R, Luo Y.  Predicting mortality in critically ill patients with diabetes using machine learning and clinical notes. BMC Med Inform Decis Mak. 2020;20:295. 10.1186/s12911-020-01318-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Afzal N, Sohn S, Abram S, et al.  Mining peripheral arterial disease cases from narrative clinical notes using natural language processing. J Vasc Surg. 2017;65:1753-1761. 10.1016/j.jvs.2016.11.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Oh IY, Schindler SE, Ghoshal N, Lai AM, Payne PRO, Gupta A.  Extraction of clinical phenotypes for Alzheimer’s disease dementia from clinical notes using natural language processing. JAMIA Open. 2023;6:ooad014. 10.1093/jamiaopen/ooad014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Cui L, Sahoo SS, Lhatoo SD, et al.  Complex epilepsy phenotype extraction from narrative clinical discharge summaries. J Biomed Inform. 2014;51:272-279. 10.1016/j.jbi.2014.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Sivarajkumar Sonish, Gao Fengyi, Denny Parker, et al.  Mining clinical notes for physical rehabilitation exercise information: natural language processing algorithm development and validation study. JMIR Med Inform.  2024;12:e52289. 10.2196/52289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Bhattarai K, Oh IY, Sierra JM, et al.  Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spaCy’s rule-based and machine learning-based methods. JAMIA Open. 2024;7:ooae060. 10.1093/jamiaopen/ooae060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Zhao WX, Zhou K, Li J, et al. A survey of large language models. arXiv e-Prints, 2023: arXiv: 2303.18223. 10.48550/arXiv.2303.18223, preprint: not peer reviewed. [DOI]
  • 24. Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. arXiv preprint arXiv: 220512689. 2022, preprint: not peer reviewed.
  • 25. OpenAI R. GPT-4 technical report. arXiv 2303. 2023, preprint: not peer reviewed.
  • 26. Van Veen D, Van Uden C, Blankemeier L, et al.  Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. 2024;30:1134-1142. 10.1038/s41591-024-02855-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of gpt-4 on medical challenge problems. arXiv Preprint arXiv: 230313375, preprint: not peer reviewed, 2023.
  • 28. Liu S, McCoy AB, Wright AP, et al.  Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J Am Med Inform Assoc. 2024;31:1388-1396. 10.1093/jamia/ocae041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Dyke F, WI C, Georg W, EM P, Gernot B, Jan-Niklas E, et al.  GPT-4 for information retrieval and comparison of medical oncology guidelines. NEJM AI.  2024;1:AIcs2300235. 10.1056/AIcs2300235. [DOI] [Google Scholar]
  • 30. Kaster L, Hillis E, Oh IY, Brain Gene Registry Consortium, et al.  Automated extraction of functional biomarkers of verbal and ambulatory ability from multi-institutional clinical notes using large language models. J Neurodev Disord. 2025;17:24. 10.1186/s11689-025-09612-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Team G, Kamath A, Ferret J, et al. Gemma 3 technical report. arXiv Preprint arXiv: 250319786. 2025, preprint: not peer reviewed.
  • 32. Guo D, Yang D, Zhang H, et al. Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv Preprint arXiv: 250112948. 2025, preprint: not peer reviewed.
  • 33. Kreimeyer K, Foster M, Pandey A, et al.  Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform. 2017;73:14-29. 10.1016/j.jbi.2017.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Wang Y, Wang L, Rastegar-Mojarad M, et al.  Clinical information extraction applications: a literature review. J Biomed Inform. 2018;77:34-49. 10.1016/j.jbi.2017.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Shivade C, Raghavan P, Fosler-Lussier E, et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records. J Am Med Inform Assoc. 2014;21:221-230. 10.1136/amiajnl-2013-001935. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V.  Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med Inform. 2019;7:e12239. 10.2196/12239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Eyre H, Chapman AB, Peterson KS, et al.  Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python. AMIA Annu Symp Proc. 2021;2021:438–447 . [PMC free article] [PubMed] [Google Scholar]
  • 38. McKinney W. Data structures for statistical computing in python. 2010. 10.25080/Majora-92bf1922-00a. [DOI]
  • 39. Harris CR, Millman KJ, van der Walt SJ, et al.  Array programming with NumPy. Nature. 2020;585:357-362. 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Shi J, Mowery DL, Doing-Harris KM, Hurdle JF.  RuSH: A Rule-Based Segmentation Tool Using Hashing for Extremely Accurate Sentence Segmentation of Clinical Text. American Medical Informatics Association Annual Symposium; 2016. [Google Scholar]
  • 41. Flynn Joseph T., Kaelber David C., Baker-Smith Carissa M., et al.  Clinical practice guideline for screening and management of high blood pressure in children and adolescents. Pediatrics. 2017;140:e20171904. 10.1542/peds.2017-1904. [DOI] [PubMed] [Google Scholar]
  • 42. Ollama. Ollama: a lightweight, extensible framework for building and running language models. Github 2024. Accessed April 22, 2025. https://github.com/ollama/ollama.
  • 43. Singhal K, Azizi S, Tu T, et al.  Large language models encode clinical knowledge. Nature. 2023;620:172-180. 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Dunn OJ.  Multiple comparisons among means. J Am Stat Assoc. 1961;56:52-64. 10.1080/01621459.1961.10482090. [DOI] [Google Scholar]
  • 45. Pedregosa F, Varoquaux G, Gramfort A, et al.  Scikit-learn: machine learning in Python. J Machine Learning Res. 2011;12:2825-2830. [Google Scholar]
  • 46. Virtanen P, Gommers R, Oliphant TE, SciPy 1.0 Contributors, et al.  SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17:261-272. 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG.  Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42:377-381. 10.1016/j.jbi.2008.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Johnson KJ, Hussain I, Williams K, Santens R, Mueller NL, Gutmann DH.  Development of an international internet-based neurofibromatosis Type 1 patient registry. Contemp Clin Trials. 2013;34:305-311. 10.1016/j.cct.2012.12.002. [DOI] [PubMed] [Google Scholar]
  • 49. Johnson KJ, Mueller NL, Williams K, Gutmann DH.  Evaluation of participant recruitment methods to a rare disease online registry. Am J Med Genet A, 2014;164A:1686-1694. 10.1002/ajmg.a.36530. [DOI] [PubMed] [Google Scholar]
  • 50. Sharkey EK, Zoellner NL, Abadin S, Gutmann DH, Johnson KJ.  Validity of participant-reported diagnoses in an online patient registry: a report from the NF1 patient registry initiative. Contemp Clin Trials. 2015;40:212-217. 10.1016/j.cct.2014.12.006. [DOI] [PubMed] [Google Scholar]
  • 51. Zhang Jingqing, Sun Kai, Jagadeesh Akshay, et al.  The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant. J Am Med Inform Assoc. 2024;31:1891. 10.1093/jamia/ocae184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Legius E, Messiaen L, Wolkenstein P, International Consensus Group on Neurofibromatosis Diagnostic Criteria (I-NF-DC), et al.  Revised diagnostic criteria for neurofibromatosis type 1 and Legius syndrome: an international consensus recommendation. Genetics Med. 2021;23:1506-1513. 10.1038/s41436-021-01170-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Carrell DS, Schoen RE, Leffler DA, et al.  Challenges in adapting existing clinical natural language processing systems to multiple, diverse health care settings. J Am Med Inform Assoc. 2017;24:986-991. 10.1093/jamia/ocx039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Cohen GR, Friedman CP, Ryan AM, Richardson CR, Adler-Milstein J.  Variation in Physicians’ electronic health record documentation and potential patient harm from that variation. J Gen Intern Med. 2019;34:2355-2367. 10.1007/s11606-019-05025-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocaf155_Supplementary_Data

Data Availability Statement

The patient level data underlying this article cannot be shared publicly in order to protect patient privacy. The data will be shared on reasonable request to the corresponding author.


Articles from Journal of the American Medical Informatics Association: JAMIA are provided here courtesy of Oxford University Press

RESOURCES