Abstract
Objectives
All clinical trials face a significant bottleneck in identifying eligible participants, particularly due to the complexity of unstructured medical texts. Recent advances in natural language processing, especially the advent of transformer-based models, have shown promise in this domain. In this study, we evaluated the performance of a prompt-based large language model (LLM) for cohort selection from unstructured medical notes.
Methods
Medical records were annotated with Med-CAT using the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) ontology. For each trial eligibility criterion, we extracted sentences containing relevant annotated concepts through an ontology-driven summarization process. These summaries were then input into a prompt-based LLM (GPT-3.5-turbo), tasked with classifying eligibility criteria in a zero-shot setting. Model performance was assessed using the 2018 National Natural Language Processing Clinical Challenges (n2c2) dataset, which required the classification of 288 patients’ medical records according to 13 eligibility criteria.
Results
The proposed prompt-based model achieved overall micro and macro F-measures of 0.9061 and 0.8060, respectively— among the highest scores reported for this dataset.
Conclusions
Our results demonstrate that integrating ontology-based extractive summarization with prompt-based LLMs can substantially improve eligibility classification. The summarization step enhanced model focus and interpretability, particularly for long or ambiguous narratives. This pipeline offers a scalable and adaptable framework for clinical trial automation and has the potential for real-world integration with electronic medical record matching systems.
Keywords: Patient Recruitment, Natural Language Processing, SNOMED CT Ontology, Large Language Models, Electronic Medical Records
I. Introduction
Recruiting a sufficient number of participants who meet specified eligibility criteria is essential for every clinical trial. The manual review of clinical data and identification of eligible cases often constitute the most labor-intensive aspect of the trial process [1].
In recent years, the development of electronic health records (EHR) has partially streamlined this process. However, the reliance on unstructured and ambiguous text in EHRs continues to pose significant challenges. To address this issue, researchers have increasingly adopted natural language processing (NLP) techniques [2].
The field of NLP is currently undergoing a revolution driven by rapid advances in LLMs. The application of LLMs to the medical domain has attracted considerable interest [3–5], with use cases ranging from clinical documentation [6,7] and decision support [8,9], to knowledge-based information retrieval and generation [10,11], medical research [12–15], and data processing and analysis [16,17]. However, the use of LLMs for clinical trial matching remains in its early stages [18].
The primary objective of this study is to investigate the effectiveness of prompt-based learning models for cohort selection in clinical trials, utilizing unstructured data from EHRs. We aimed to determine whether prompt-based methods can achieve results comparable to, or better than, those obtained using traditional NLP techniques. Our focus was not limited to benchmarking model performance, but also to assessing whether structured summarization could enhance eligibility classification.
Our focus is on a specific challenge—cohort identification in the 2018 National NLP Clinical Challenges (n2c2)—which provided a dataset of free-text medical records from 288 patients.
II. Methods
1. Dataset
We utilized the dataset from the 2018 n2c2 shared task (track 1), which centered on cohort selection for clinical trials. This dataset includes 288 patient records, each manually labeled by experts to indicate whether the patient met any of 13 eligibility criteria. These criteria cover MAJOR-DIABETES (major diabetes complications), DRUG-ABUSE (history of drug abuse), ABDOMINAL (history of intra-abdominal surgery, intestinal resection, or small bowel obstruction), MI-6MOS (myocardial infarction within the last 6 months), KETO-1YR (diagnosis of ketoacidosis within the past year), ALCOHOL-ABUSE (current alcohol use above recommended limits), MAKES-DECISIONS (ability to make medical decisions independently), ENGLISH (ability to speak English), ASP-FOR-MI (aspirin use for myocardial infarction prevention), ADVANCED-CAD (advanced cardiovascular disease), DIETSUPP-2MOS (dietary supplement use in past 2 months), HBA1C (HbA1c between 6.5% and 9.5%), and CREATININE (serum creatinine above normal). Some criteria, such as MI-6MOS, KETO-1YR, ALCOHOL-ABUSE, and DRUG-ABUSE, have very limited instances in the data, making accurate identification particularly challenging. The dataset was divided into a training set (202 records) and a test set (86 records), with significant class imbalance present.
2. Methodology
Figure 1 provides an overview of the proposed framework’s main stages. Each step is detailed in the following sections.
Figure 1.
Summary of the proposed framework for the application of prompt-based learning for cohort selection.
1) Knowledge graph
Traditionally, ontologies in NLP are used to standardize terminology and features. For this study, we leveraged Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT).
Unlike previous studies that manually curated trigger words for text processing, our method utilizes established ontologies to enable a semi-automatic extractive summarization and relevant sentence extraction process—specifically targeting sentences containing SNOMED CT concepts aligned with each criterion. This involves leveraging concepts encoded in SNOMED CT within the Unified Medical Language System (UMLS) to extract pertinent information from eligibility criteria and patient records, akin to the method described in [19] for evaluating the similarity between criteria and patients. UMLS is the largest comprehensive repository of biomedical concepts and their relationships [20]. For example, the eligibility criterion “ABDOMINAL: History of intra-abdominal surgery, small or large intestine resection, or small bowel obstruction” is translated into SNOMED CT keywords such as “major abdominal surgery,” “small intestine excision,” “large intestine excision,” and “small bowel obstruction.” All descendants of these terms are retrieved automatically with their concept unique identifier (CUI) codes.
To support this process, we employed PyMedTermino2 [21], a Python library built on Owlready2, which facilitates programmatic access to the SNOMED CT ontology via the UMLS 2021AA release. We used the “is-a” relationship to automatically expand broad clinical concepts (e.g., “major abdominal surgery”) into comprehensive lists of descendant terms. For instance, “colectomy” (removal of large bowel) is a descendant of “major abdominal surgery” and was thus included. Retrieved concepts included each term’s label, synonyms, and child terms. We constructed a code list for each criterion comprising the primary SNOMED CT-related keywords; for example, “161617006, major abdominal surgery [Finding]” along with its associated descendants (e.g., colectomy).
2) Extractive summarization
Extractive summarization entails selecting essential sentences from the source text to create a concise summary that respects the input length constraints of the LLM, while preserving the original meaning [22].
In this study, we applied minimal pre-processing to best evaluate the capabilities of LLMs in handling medical narratives. The main pre-processing steps included time normalization and the expansion of certain acronyms and abbreviations, using regular expressions and rule-based approaches.
At this stage, free-text clinical records were automatically annotated by MedCAT [23], an NLP tool extending sciSpa-Cy to recognize named entities linked to clinical concepts. These entities were mapped to SNOMED CT and UMLS terminologies. We used MedCAT v1.0.0rc2, publicly available and pre-trained on the MIMIC-III database and the entirety of SNOMED CT. Patients’ free-text records were annotated by MedCAT, and the extracted concepts were matched against the SNOMED CT descendant code lists from the previous step.
Finally, we extracted temporal information and sentences from the records containing annotated concepts whose CUI codes matched those for each eligibility criterion.
3) Prompt engineering
A prompt is an instruction or question provided to the model to guide its response [24]. In this step, each eligibility criterion was converted into a prompt and presented to the GPT-3.5-turbo model, together with the summarized text from the previous stage. Each criterion was reworded into a question or prompt; Table 1 shows examples for all 13 criteria. For binary classification, our approach required the model to respond with a single word: “yes” or “no.” We mapped “yes” to the class “met” and “no” to “not met.” The model was instructed to determine whether each summarized record met the specified criterion. Evaluation was conducted using all 86 records in the test set. Our approach used a fully zero-shot setting, with no additional fine-tuning on the n2c2 dataset. An illustrative example of zero-shot learning for the “ABDOMINAL” criterion is shown in Figure 2.
Table 1.
13 included eligibility criteria, their definitions, and the question-based prompts created in this study
| Cohort selection criteria | Definition provided by the challenge | Question-based prompt |
|---|---|---|
| ABDOMINAL | History of intra-abdominal surgery, small or large intestine resection, or small bowel obstruction | Does the patient in the following text have a history of abdominal surgery? Answer with one word: yes or no. |
| ADVANCED-CAD | Advanced cardiovascular disease (CAD). For this annotation, we define “advanced” as having at least 2 of the following: Taking 2 or more medications to treat CAD, history of myocardial infarction (MI), currently experiencing angina, and ischemia, past or present. | Does the patient described in the following text meet at least two of the following four criteria: taking two or more medications to treat coronary artery disease (CAD), history of myocardial infarction (MI), currently experiencing angina, or ischemia? |
| ALCOHOL-ABUSE | Current alcohol use over weekly recommended limits | Is the patient in the following text currently consuming alcohol above the recommended weekly limits? |
| ASP-FOR-MI | Use of aspirin to prevent a myocardial infarction (MI) | Is the patient in the following text using aspirin specifically to prevent myocardial infarction (MI), and not for any other condition? |
| CREATININE | Serum creatinine > upper limit of normal | Does the patient in the following text have a creatinine (Cr) level greater than 1.4? |
| DIETSUPP-2MOS | Taken a dietary supplement (excluding vitamin D) in the past 2 months | Please review the patient’s record and determine whether any dietary supplements are mentioned in the record dated (current time) or within the preceding two months. If so, answer yes; otherwise, answer no. Consider each record date noted at the start of each paragraph as YYYY-MM-DD. Dietary supplements include folic acid, multivitamins, vitamins (excluding vitamin D), calcium, magnesium, iron, echinacea and ginger, caffeine and curcumin, tryptophan and glutamine, probiotics, glucosamine, and fish oils. |
| DRUG-ABUSE | Drug abuse, current or past | Is the patient in the following text experiencing drug abuse? Answer with one word: yes or no. |
| ENGLISH | The patient must speak | English Does the patient in the following text speak a language other than English? Answer with one word: yes or no. |
| HBA1C | Any hemoglobin A1c (HbA1c) value between 6.5% and 9.5% | Has the patient ever had a hemoglobin value between 6.5 and 9.5? If at least one value in this range is mentioned, answer ‘yes’; otherwise, answer ‘no’. |
| KETO-1YR | Diagnosis of ketoacidosis in the past year | Has the patient in the following text been diagnosed with ketoacidosis within the past year? Use the record date labeled (current time) as the reference and consider each record date at the beginning of each paragraph as YYYY-MM-DD. |
| MAJOR-DIABETES | Major diabetes-related complications. For this annotation, we define “major complications” (as opposed to “minor complications”) as any of the following that are a result of (or strongly correlated with) uncontrolled diabetes: amputation, kidney damage, skin conditions, retinopathy, nephropathy, and neuropathy | Does the patient in the following text have a history of amputation or any disease affecting the kidney, skin, retinopathy, nephropathy, or neuropathy? If at least one such condition is mentioned, answer ‘yes’; otherwise, answer ‘no’. |
| MAKES-DECISIONS | Patients must make their own medical decisions | Does the patient in the following text have cognitive limitations that prevent them from making their own medical decisions? Answer with yes or no. |
| MI-6MOS | MI in the past 6 months | Does the patient in the following text have any mention of myocardial infarction in the record dated (current time) or within the preceding six months? Use the record date at the beginning of each paragraph (YYYY-MM-DD) as the paragraph’s correct date (ignore other dates mentioned within the text). |
Figure 2.
Prompt-based approach for one of the criteria.
4) Evaluation metrics
We followed the evaluation protocol established by the challenge organizers. GPT output was submitted to a predefined evaluation script, which computed precision, recall, F score, specificity, and area under the receiver operating characteristic curve (AUC). In our experiment, precision reflects the proportion of records labeled as “met” that were truly “met.” Recall measures the proportion of actual “met” records correctly identified by our model. Specificity quantifies the proportion of records labeled as “not met” that were correctly classified. The F score, combining both precision and recall, was calculated in both micro and macro variants. Micro F1 aggregates true positives, false positives, and false negatives across all classes before computing the score, whereas macro F1 calculates the F1 for each class independently and averages them, giving equal weight to rare and common classes. In imbalanced datasets, the micro F1 score often exceeds macro F1 because it reflects better performance on dominant classes [25].
III. Results
1. Model
For our experiments, we used OpenAI’s GPT-3.5-turbo model, recognized as the most capable and cost-effective variant of GPT-3.5 at the time of the study [26]. To ensure reproducibility across multiple runs, we set the temperature parameter to 0 for all requests. While a temperature of 0 substantially increases output determinism, a small degree of variability can still occur due to the inherent stochasticity of transformer-based models. The model was instructed to respond with a single word (“yes” or “no”). Although the temperature setting standardized outputs, slight variability may have persisted.
2. Results
To illustrate the effect of the summarization method, we present results for two scenarios. In the first scenario, no summarization was performed; the clinical free-text and prompts were given directly to the model. Due to the model’s text processing limitation of approximately 4,096 tokens, we truncated longer medical narratives and excluded the conclusions from those records. It should be noted that about 11% of the records exceeded 4,000 words. Table 2 provides a summary of the findings from this initial scenario.
Table 2.
Summary of the evaluation metrics overall and for each criterion, presented separately and without summarization
| Met | Not met | Overall | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|||||||
| Precision | Recall | Specificity | F (b=1) | Precision | Recall | F (b=1) | F (b=1) | AUC | |
| ABDOMINAL | 0.5455 | 0.2000 | 0.9107 | 0.2927 | 0.6800 | 0.9107 | 0.7786 | 0.5357 | 0.5554 |
|
| |||||||||
| ADVANCED-CAD | 0.8148 | 0.4889 | 0.8780 | 0.6111 | 0.6102 | 0.8780 | 0.7200 | 0.6656 | 0.6835 |
|
| |||||||||
| ALCOHOL-ABUSE | 0 | 0 | 0.9880 | 0 | 0.9647 | 0.9880 | 0.9762 | 0.4881 | 0.4940 |
|
| |||||||||
| ASP-FOR-MI | 1 | 0.0735 | 1 | 0.1370 | 0.2222 | 1 | 0.3636 | 0.2503 | 0.5368 |
|
| |||||||||
| CREATININE | 0.8571 | 0.2500 | 0.9839 | 0.3871 | 0.7722 | 0.9839 | 0.8652 | 0.6262 | 0.6169 |
|
| |||||||||
| DIETSUPP-2MOS | 1 | 0.0455 | 1 | 0.0870 | 0.5000 | 1 | 0.6667 | 0.3768 | 0.5227 |
|
| |||||||||
| DRUG-ABUSE | 1 | 0.3333 | 1 | 0.5000 | 0.9765 | 1 | 0.9881 | 0.7440 | 0.6667 |
|
| |||||||||
| ENGLISH | 0.8488 | 1 | 0 | 0.9182 | 0 | 0 | 0 | 0.4591 | 0.5000 |
|
| |||||||||
| HBA1C | 0 | 0 | 1 | 0 | 0.5930 | 1 | 0.7445 | 0.3723 | 0.5000 |
|
| |||||||||
| KETO-1YR | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0.5000 | 0.5000 |
|
| |||||||||
| MAJOR-DIABETES | 1 | 0.3488 | 1 | 0.5172 | 0.6056 | 1 | 0.7544 | 0.6358 | 0.6744 |
|
| |||||||||
| MAKES-DECISIONS | 0.9765 | 1 | 0.3333 | 0.9881 | 1 | 0.3333 | 0.5000 | 0.7440 | 0.6667 |
|
| |||||||||
| MI-6MOS | 0 | 0 | 0.9487 | 0 | 0.9024 | 0.9487 | 0.9250 | 0.4625 | 0.4744 |
|
| |||||||||
| Overall (micro) | 0.8730 | 0.4641 | 0.9530 | 0.6060 | 0.7185 | 0.9530 | 0.8193 | 0.7126 | 0.7085 |
|
| |||||||||
| Overall (macro) | 0.6187 | 0.2877 | 0.8494 | 0.3414 | 0.6790 | 0.8494 | 0.7140 | 0.5277 | 0.5686 |
AUC: area under the receiver operating characteristic curve.
See Table 1 for abbreviations for criteria.
After applying our proposed summarization method in the second scenario, we observed a marked improvement in the model’s performance. Table 3 summarizes the results using summarization: our model achieved micro and macro F-measures of 0.9035 and 0.7949, respectively. The criteria ALCOHOL-ABUSE, CREATININE, ENGLISH, and ABDOMINAL achieved the highest F-measures (above 0.9), while KETO-1YR, MI-6MOS, and MAKES-DECISIONS had the lowest F-scores (below 0.7).
Table 3.
Summary of the evaluation metrics overall and for each criterion, separately using the SNOMED CT-based summarization method
| Met | Not met | Overall | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|||||||
| Precision | Recall | Specificity | F (b=1) | Precision | Recall | F (b=1) | F (b=1) | AUC | |
| ABDOMINAL | 0.9000 | 0.9000 | 0.9464 | 0.9000 | 0.9464 | 0.9464 | 0.9464 | 0.9232 | 0.9232 |
|
| |||||||||
| ADVANCED-CAD | 0.7843 | 0.8889 | 0.7317 | 0.8333 | 0.8571 | 0.7317 | 0.7895 | 0.8114 | 0.8103 |
|
| |||||||||
| ALCOHOL-ABUSE | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
|
| |||||||||
| ASP-FOR-MI | 0.9041 | 0.9706 | 0.6111 | 0.9362 | 0.8462 | 0.6111 | 0.7097 | 0.8229 | 0.7908 |
|
| |||||||||
| CREATININE | 1 | 0.8750 | 1 | 0.9333 | 0.9538 | 1 | 0.9764 | 0.9549 | 0.9375 |
|
| |||||||||
| DIETSUPP-2MOS | 0.9677 | 0.6818 | 0.9762 | 0.8000 | 0.7455 | 0.9762 | 0.8454 | 0.8227 | 0.8290 |
|
| |||||||||
| DRUG-ABUSE | 1 | 0.3333 | 1 | 0.5000 | 0.9765 | 1 | 0.9881 | 0.7440 | 0.6667 |
|
| |||||||||
| ENGLISH | 0.9730 | 0.9863 | 0.8462 | 0.9796 | 0.9167 | 0.8462 | 0.8800 | 0.9298 | 0.9162 |
|
| |||||||||
| HBA1C | 0.8077 | 0.6000 | 0.9020 | 0.6885 | 0.7667 | 0.9020 | 0.8288 | 0.7587 | 0.7510 |
|
| |||||||||
| KETO-1YR | 0 | 0 | 0.9884 | 0 | 1 | 0.9884 | 0.9942 | 0.4971 | 0.4942 |
|
| |||||||||
| MAJOR-DIABETES | 0.9412 | 0.7442 | 0.9535 | 0.8312 | 0.7885 | 0.9535 | 0.8632 | 0.8472 | 0.8488 |
|
| |||||||||
| MAKES-DECISIONS | 0.9762 | 0.9880 | 0.3333 | 0.9820 | 0.5000 | 0.3333 | 0.4000 | 0.6910 | 0.6606 |
|
| |||||||||
| MI-6MOS | 0.3636 | 0.5000 | 0.9103 | 0.4211 | 0.9467 | 0.9103 | 0.9281 | 0.6746 | 0.7051 |
|
| |||||||||
| Overall (micro) | 0.9068 | 0.8693 | 0.9378 | 0.8877 | 0.9115 | 0.9378 | 0.9245 | 0.9061 | 0.9035 |
|
| |||||||||
| Overall (macro) | 0.8168 | 0.7283 | 0.8615 | 0.7542 | 0.8649 | 0.8615 | 0.8577 | 0.8060 | 0.7949 |
SNOMED CT: Systematized Nomenclature of Medicine Clinical Terms, AUC: area under the receiver operating characteristic curve.
See Table 1 for abbreviations for criteria.
Table 4 compares our model’s scores to the best results from other published machine learning (ML) methods on the same n2c2 dataset, both on average and for each criterion. Encouragingly, our model achieved the second-highest micro F-measure and the highest macro F-measure among all compared methods. Moreover, it achieved the top F-score for five of 13 criteria: ABDOMINAL, CREATININE, ASP-FOR-MI, ENGLISH, and ALCOHOL-ABUSE.
Table 4.
Comparison of the results of our proposed method with the best results of other ML-based experiments published with this dataset
| Proposed method | Hassanzadeh et al. (Enriched multi-layer perception) | Segura-Bedmar et al. (Hybrid CNN+RNN approach) | Xiong, Peng et al. (NCBI-BERT + attention) | Chen et al. (Medical knowledge-infused CNN) | Spasic et al. | Xiong, Shi et al. (LSTM-highway-LSTM) | |
|---|---|---|---|---|---|---|---|
| ABDOMINAL | 0.9232 | 0.7226 | 0.4792 | 0.8530 | 0.7680 | 0.7677 | 0.7811 |
| ADVANCED-CAD | 0.8114 | 0.7654 | 0.3478 | 0.8947 | 0.6840 | 0.8814 | 0.8103 |
| ALCOHOL-ABUSE | 1 | 0.4911 | 0.4915 | 0.4911 | 0.7440 | 0.6417 | 0.4911 |
| ASP-FOR-MI | 0.8229 | 0.4416 | 0.4340 | 0.7735 | 0.7430 | 0.7442 | 0.7734 |
| CREATININE | 0.9549 | 0.7480 | 0.5581 | 0.7955 | 0.7180 | 0.8716 | 0.8380 |
| DIETSUPP-2MOS | 0.8227 | 0.7185 | 0.6703 | 0.8586 | 0.7560 | 0.8350 | 0.8371 |
| DRUG-ABUSE | 0.7440 | 0.4911 | 0.4828 | 0.7441 | 0.7560 | 0.7378 | 0.6910 |
| ENGLISH | 0.9298 | 0.4591 | 0.4737 | 0.8303 | 0.7920 | 0.7929 | 0.8745 |
| HBA1C | 0.7587 | 0.6098 | 0.4000 | 0.8259 | 0.7730 | 0.9253 | 0.9253 |
| KETO-1YR | 0.4971 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 |
| MAJOR-DIABETES | 0.8472 | 0.8372 | 0.4665 | 0.8954 | 0.8140 | 0.8254 | 0.8602 |
| MAKES-DECISIONS | 0.6910 | 0.4911 | 0.4828 | 0.7440 | 0.4910 | 0.6910 | 0.7440 |
| MI-6MOS | 0.6746 | 0.4756 | 0.4828 | 0.4756 | 0.4760 | 0.6605 | 0.7933 |
| Overall micro F score | 0.9061 | 0.8399 | 0.7856 | 0.9070 | 0.8610 | 0.8904 | 0.9021 |
| Overall macro F score | 0.8060 | 0.5962 | 0.4823 | - | 0.6730 | - | - |
ML: machine learning, CNN: convolutional neural network, RNN: recurrent neural network, NCBI: National Center for Biotechnology Information, BERT: bidirectional encoder representations from transformer, LSTM: long short-term memory.
See Table 1 for abbreviations for criteria.
Hassanzadeh et al. Matching patients to clinical trials using semantically enriched document representation. J Biomed Inform 2020;105:103406
Segura-Bedmar et al. Cohort selection for clinical trials using deep learning models. J Am Med Inform Assoc 2019;26(11):1181–8.
Xiong, Peng et al. A unified machine reading comprehension framework for cohort selection. IEEE J Biomed Health Inform 2021;26(1):379–87.
Chen et al. Medical knowledge infused convolutional neural networks for cohort selection in clinical trials. J Am Med Inform Assoc 2019;26(11):1227–36.
Spasic et al. A text mining approach to cohort selection from longitudinal patient records. https://preprints.jmir.org/preprint/15980/submitted.
Xiong, Shi et al. Cohort selection for clinical trials using hierarchical neural network. J Am Med Inform Assoc 2019;26(11):1203–8.
We also conducted limited experiments with GPT-4. Unlike GPT-3.5, GPT-4 does not impose strict input length limitations, allowing us to provide unsummarized clinical freetext with prompts. However, we did not observe consistent performance improvements compared to GPT-3.5-turbo. These findings underscore the importance of structured summarization, which enhances the effective use of the model’s capacity. The detailed results for GPT-4 are provided in Supplement A.
IV. Discussion
The application of a prompt-based learning model in this study yielded promising results. To further illustrate the process, we summarized the results for one representative criterion, “ALCOHOL-ABUSE,” in Supplementary Materials. Supplement B provides the list of SNOMED CT terms identified for “ALCOHOL-ABUSE.” Supplement C presents, for each test record, the selected sentences containing one of these terms, the ground truth label, and the model’s prediction for comparison.
Our model achieved overall micro and macro F-scores of 0.9061 and 0.8060, respectively, demonstrating acceptable performance across all criteria. To evaluate performance by criterion, we categorized the criteria based on the NLP methods required, as described by Stubbs et al. [27]:
(1) Concept extraction: Four criteria—ABDOMINAL, MAJOR-DIABETES, CREATININE, and HBA1C—primarily required the extraction of clinical terms. We addressed this by mapping each criterion to relevant SNOMED CT concepts. While standardized ontologies improve consistency, reproducibility, and completeness, challenges remain. Many records in the challenge dataset used rare medical terms, general pathology terms rather than precise category names, or uncommon abbreviations.
(2) Temporal reasoning: Another four criteria—DIET-SUPP-2MOS, MI-6MOS, ADVANCED-CAD, and KETO- 1YR—necessitated temporal processing. Since the data consisted of medical records from different time points, these criteria required time-aware analysis. Relevant sentences and their associated timestamps were extracted for these criteria.
For DIETSUPP-2MOS, a key advantage was leveraging the SNOMED CT ontology to identify supplement names, instead of relying on manually curated dictionaries. The main limitation, however, was incomplete coverage of commercial names and abbreviations.
For MI-6MOS, although our system achieved the second-highest F-score among ML methods, the main limitation was false negatives due to gaps in the SNOMED CT terms used for sentence selection. To address this, we used more general concepts such as “ischemic heart disease” or “disorders of coronary artery” rather than only “myocardial infarction,” but some records still lacked direct references to infarction.
For ADVANCED-CAD, results showed accurate selection of relevant sentences and identification of required time points. However, the exact definitions for required signs and symptoms were not always specified. For example, some antihypertensive drugs may be prescribed to patients with ischemic heart disease absent definitive hypertension, and it is unclear whether such cases meet the criterion.
(3) Inference: Our model outperformed other ML methods for ASP-FOR-MI and ENGLISH, but MAKES-DECISIONS was more challenging. Error analysis revealed that most misclassifications were due to difficulties in concept extraction rather than logical inference. MAKES-DECISIONS encompasses a broad clinical pathology with diverse signs and symptoms, making extraction of all relevant SNOMED CT concepts challenging.
For DRUG-ABUSE and ALCOHOL-ABUSE, our model performed well, with only a few false negatives due to missing rare drug names in the SNOMED CT concept list.
Compared to rule-based systems, ML models offer greater efficiency and adaptability by learning directly from data. Their scalability is evident in their ability to utilize larger, more diverse datasets without the need for rule modifications. Thus, ML techniques are well-suited for handling increasingly complex clinical data [28].
Moreover, the strengths of GPT models for cohort selection become even more apparent with larger datasets. First, in clinical practice, generating labeled training data is often cumbersome; GPT models are advantageous because they do not require task-specific training. Second, GPT models are ideal when a language model with a broad knowledge base is needed, as they are extensively pre-trained on a variety of text sources. Third, pre-processing is typically one of the most time-consuming and expertise-dependent steps in NLP pipelines. Achieving strong results with GPT models without extensive pre-processing can expedite the clinical matching process and reduce reliance on domain experts [29,30].
Additionally, we introduced an extractive summarization method using the SNOMED CT ontology and an annotation tool to capture essential information from source texts. This approach not only addresses the input length limitations of LLMs but also enhances prompt-based model performance. Supplement D provides an example: it displays an unsummarized free-text record alongside its summarized version, the prompt for one eligibility criterion (major abdominal surgery), the true label, the GPT-3.5 prediction using summarized data, and the GPT-4 prediction without summarization. Notably, GPT-4 misclassified the record, while GPT-3.5 with summarization produced the correct label—highlighting the value of summarization for removing irrelevant content and improving classification accuracy.
While newer models such as GPT-4—with larger context windows and improved reasoning—offer promising alternatives, they come with higher computational costs and infrastructure requirements. Our study demonstrates that an ontology-based summarization pipeline can significantly enhance LLM performance, even for models with limited input capacity.
In conclusion, our automated clinical trial matching solution streamlines what is typically a manual, time-consuming recruitment process by leveraging advanced technologies. A key direction for future work is the integration of structured summarization with chain-of-thought (CoT) prompting to further improve clinical eligibility classification. CoT techniques facilitate the decomposition of complex instructions into intermediate reasoning steps, which may enhance accuracy, particularly for tasks involving inference, temporality, or implicit logic.
Footnotes
Conflict of Interest
No potential conflict of interest relevant to this article was reported.
Supplementary Materials
Supplementary materials can be found via https://doi.org/10.4258/hir.2025.31.4.367.
References
- 1.Nathan RA. How important is patient recruitment in performing clinical trials? J Asthma. 1999;36(3):213–6. doi: 10.3109/02770909909075405. [DOI] [PubMed] [Google Scholar]
- 2.Schreiweis B, Trinczek B, Kopcke F, Leusch T, Majeed RW, Wenk J, et al. Comparison of electronic health record system functionalities to support the patient recruitment process in clinical trials. Int J Med Inform. 2014;83(11):860–8. doi: 10.1016/j.ijmedinf.2014.08.005. [DOI] [PubMed] [Google Scholar]
- 3.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008. [Google Scholar]
- 4.Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X. Pretrained models for natural language processing: a survey. Sci China Technol Sci. 2020;63(10):1872–97. doi: 10.1007/s11431-020-1647-3. [DOI] [Google Scholar]
- 5.Devlin J, Chang MW, Lee K, Toutanova K. BERT: pretraining of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); 2019 Jun 2–7; Minneapolis, MN, USA. pp. 4171–86. [DOI] [Google Scholar]
- 6.Lehman E, Johnson A. Clinical-T5: large language models built using mimic clinical text. PhysioNet. 2023;101:215–20. doi: 10.13026/rj8x-v335. [DOI] [Google Scholar]
- 7.Ma C, Wu Z, Wang J, Xu S, Wei Y, Zeng F, et al. ImpressionGPT: an iterative optimizing framework for radiology report summarization with ChatGPT [Internet] Ithaca (NY): arXiv.org; 2023. [cited at 2025 Sep 30]. Available from: https://arxiv.org/abs/2304.08448v1. [Google Scholar]
- 8.Daungsupawong H, Wiwanitkit V. Performance of Chat-GPT as an AI-assisted decision support tool in medicine: comment. Acta Cardiol. 2024;79(3):413. doi: 10.1080/00015385.2024.2322185. [DOI] [PubMed] [Google Scholar]
- 9.Rao A, Kim J, Kamineni M, Pang M, Lie W, Dreyer KJ, et al. Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast imaging pilot. J Am Coll Radiol. 2023;20(10):990–7. doi: 10.1016/j.jacr.2023.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Liu Z, Li Y, Shu P, Zhong A, Jiang H, Pan Y, et al. Radiology-GPT: a large language model for radiology [Internet] Ithaca (NY): arXiv.org; 2023. [cited at 2025 Sep 30]. Available from: https://arxiv.org/abs/2306.08666v1. [Google Scholar]
- 11.Liu Y, Han T, Ma S, Zhang J, Yang Y, Tian J, et al. Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-radiology. 2023;1(2):100017. doi: 10.1016/j.metrad.2023.100017. [DOI] [Google Scholar]
- 12.Wang J, Shi E, Yu S, Wu Z, Ma C, Dai H, et al. Prompt engineering for healthcare: Methodologies and applications [Internet] Ithaca (NY): arXiv.org; 2023. [cited at 2025 Sep 30]. Available from: https://arxiv.org/abs/2304.14670v1. [Google Scholar]
- 13.Zhang K, Yu J, Adhikarla E, Zhou R, Yan Z, Liu Y, et al. BiomedGPT: a unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks [Internet] Ithaca (NY): arXiv. org; 2023. [cited at 2025 Sep 30]. Available from: https://arxiv.org/abs/2305.17100v1. [Google Scholar]
- 14.Liu Z, Wu Z, Hu M, Zhao B, Zhao L, Zhang T, et al. PharmacyGPT: the AI pharmacist. Ithaca (NY): arXiv. org; 2023. [cited at 2025 Sep 30]. Available from: https://arxiv.org/abs/2307.10432v1. [Google Scholar]
- 15.Zhong T, Pan Y, Zhang Y, Wei Y, Yang L, Liu Z, et al. ChatABL: abductive learning via natural language interaction with ChatGPT. IEEE Trans Neural Netw Learn Syst. 2025;36(10):17635–17649. doi: 10.1109/TNNLS.2025.3567945. [DOI] [PubMed] [Google Scholar]
- 16.Dai H, Liu Z, Liao W, Huang X, Cao Y, Wu Z, et al. Aug-GPT: leveraging chatGPT for text data augmentation [Internet] Ithaca (NY): arXiv.org; 2023. [cited at 2025 Sep 30]. Available from: https://arxiv.org/abs/2302.13007. [Google Scholar]
- 17.Dai H, Li Y, Liu Z, Zhao L, Wu Z, Song S, et al. AD-AutoGPT: an autonomous GPT for Alzheimer’s disease infodemiology [Internet] Ithaca (NY): arXiv.org; 2023. [cited at 2025 Sep 30]. Available from: https://arxiv.org/abs/2306.10095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ghim JL, Ahn S. Transforming clinical trials: the emerging roles of large language models. Transl Clin Pharmacol. 2023;31(3):131–8. doi: 10.12793/tcp.2023.31.e16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chang E, Mostafa J. Cohort identification from free-text clinical notes using SNOMED CT’s hierarchical semantic relations. AMIA Annu Symp Proc. 2023;2022:349–58. [PMC free article] [PubMed] [Google Scholar]
- 20.Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(Database issue):D267–70. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lamy JB, Venot A, Duclos C. PyMedTermino: an open-source generic API for advanced terminology services. Stud Health Technol Inform. 2015;210:924–8. [PubMed] [Google Scholar]
- 22.Jain D, Borah MD, Biswas A. Summarization of legal documents: where are we now and the way forward. Comput Sci Rev. 2021;40:100388. doi: 10.1016/j.cosrev.2021.100388. [DOI] [Google Scholar]
- 23.Kraljevic Z, Searle T, Shek A, Roguski L, Noor K, Bean D, et al. Multi-domain clinical natural language processing with MedCAT: the medical concept annotation toolkit. Artif Intell Med. 2021;117:102083. doi: 10.1016/j.artmed.2021.102083. [DOI] [PubMed] [Google Scholar]
- 24.Heston TF, Khun C. Prompt engineering in medical education. Int Med Educ. 2023;2(3):198–205. doi: 10.3390/ime2030019. [DOI] [Google Scholar]
- 25.Riyanto S, Imas SS, Djatna T, Atikah TD. Comparative analysis using various performance metrics in imbalanced data for multi-class text classification. Int J Adv Comput Sci Appl. 2023;14(6):1082–90. doi: 10.14569/IJACSA.2023.01406116. [DOI] [Google Scholar]
- 26.OpenAI . GPT-3.5 Turbo API [Internet] San Francisco (CA): OpenAI; 2025. [cited at 2025 Sep 30]. Available from: https://platform.openai.com/docs/models/gpt-3.5-turbo. [Google Scholar]
- 27.Stubbs A, Filannino M, Soysal E, Henry S, Uzuner O. Cohort selection for clinical trials: n2c2 2018 shared task track 1. J Am Med Inform Assoc. 2019;26(11):1163–71. doi: 10.1093/jamia/ocz163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Rijcken E, Zervanou K, Mosteiro P, Scheepers F, Spruit M, Kaymak U. Machine learning vs. rule-based methods for document classification of electronic health records within mental health care: a systematic literature review. Nat Lang Process J. 2025;10:100129. doi: 10.1016/j.nlp.2025.100129. [DOI] [Google Scholar]
- 29.Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. doi: 10.3389/frai.2023.1169595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Thirunavukarasu AJ, Ting DS, Elangovan K, Gutierrez L, Tan TF, Ting DS. Large language models in medicine. Nat Med. 2023;29(8):1930–40. doi: 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


