Abstract
Background
Core Outcome Sets (COS) are essential for standardizing outcome reporting in clinical research, yet their development remains resource-intensive and time-consuming. Traditional COS development requires months of expert work for manual outcome extraction and classification from literature. While machine learning (ML) has shown promise in automating systematic reviews, its application to COS development, particularly for outcome identification and classification, remains underexplored. This study evaluates whether ML models can accurately extract and classify verbatim outcomes from clinical studies according to the COMET taxonomy and determines the amount of manually annotated data needed to support reliable model performance.
Methods
We developed an ML pipeline using a dataset of 149 full-text studies on lower limb lengthening surgery. The pipeline comprised a Sentence-BERT-based extraction model for identifying verbatim outcomes and a classification model for assigning outcomes to COMET taxonomy domains. We systematically assessed performance using training sets ranging from 5 to 85 articles to establish a practical threshold for reliable model behavior. Model performance was validated using a 28-article hold-out set with standard metrics: precision, recall, and F1-score.
Results
A training size of 20 articles proved sufficient for stable model performance. The extraction model achieved an F1-score of 94% with precision and recall above 90%. The classification model attained a weighted-average F1-score of 86%, with 87% precision and 88% recall. When applied to the full dataset, the system successfully identified 94% of manually extracted outcomes. The distribution of outcome domains identified by ML closely mirrored manual classification with high accuracy.
Conclusion
This study demonstrates the feasibility of applying ML-based outcome extraction and classification within a specific COS development context for lower limb lengthening surgery. By reducing annotation requirements from 149 to just 20 articles while maintaining high accuracy, our approach offers a scalable, reproducible solution that substantially reduces the manual workload in COS development. This pipeline can play a significant role in streamlining evidence synthesis processes, potentially accelerating the generation of outcome lists for consensus-building exercises in COS development.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13018-025-06386-8.
Keywords: Core outcome sets, Machine learning, Outcome extraction, Outcome classification, Lower limb lengthening surgery, Transfer learning
Introduction
Core Outcome Sets (COS) are standardized sets of outcomes agreed upon by stakeholders for use in clinical studies and systematic reviews [1]. By reducing heterogeneity in outcome reporting, COS enhance the comparability and synthesis of evidence across studies, ultimately improving the applicability of research findings to clinical practice [2, 3]. The development of COS typically begins with a scoping or systematic review to identify potential outcomes, followed by stakeholder consensus to prioritize and finalize the set [4, 5]. Despite their growing adoption and development across specialties, the traditional process of COS development remains highly labor-intensive, time-consuming, and prone to inconsistencies [6]. Consequently, COS development often requires several months of expert attention and remains susceptible to human error [7, 8]. Additionally, the repeated identification of similar outcomes after reaching saturation creates inefficiency and waste of resources [9].
Recent advancements in artificial intelligence (AI), particularly machine learning (ML), offer promising solutions to these challenges [10]. ML has demonstrated success in automating data extraction for systematic reviews, reducing review times while maintaining accuracy [7, 11]. For instance, large language models (LLMs) have shown utility in systematic reviews by significantly decreasing the time required for data extraction [11, 12]. These technologies excel at processing large datasets quickly and accurately, presenting opportunities to streamline key components of COS development. Automating outcome reporting in COS could reduce workload, enhance data availability, and improve research quality.
We hypothesize that ML algorithms can accurately extract and classify verbatim outcomes from research papers into a COS framework [13], with minimal manual input.
This study will first investigate the relationship between the amount of manual input, quantified as the number of manually annotated training articles, and the resulting accuracy of the ML algorithms. Based on this analysis, we aim to propose a practical cut-off for the amount of manual input required to achieve acceptable performance.
The overall accuracy of the algorithms in extracting and classifying verbatim outcomes related to lower limb lengthening surgery (LLLS) will then be evaluated using a model trained on the number of articles identified by this cut-off, with performance measured by precision, recall, and F1-score compared to the manual benchmark [14].
Methods
Selection of reference standard
Our training and reference data were derived from a prior scoping review of outcomes in LLLS, conducted to inform the development of a COS [14]. The review included 149 studies. Outcomes from these studies were first manually extracted word-for-word (verbatim outcomes) by one investigator and subsequently verified for accuracy by a second investigator. Each extracted outcome was then manually categorized into predefined high-level outcome categories (outcome domains) based on the COMET taxonomy, a recognized standard for outcome classification. These outcome domains are grouped into four main areas: Mortality, Physiological/Clinical, Life Impact, and Adverse Events (the area of Resource Use was excluded from this study) [13].
This step aimed to match each identified outcome to the domain category that fits best. The COMET taxonomy includes 34 outcome domains across these four areas, covering categories such as musculoskeletal and connective tissue outcomes, physical functioning, and global quality of life. This dataset was used as the training and reference source for developing and evaluating the ML models.
For example, a verbatim outcome such as “Bone Healing Index” would be manually extracted and then classified within the outcome domain “Musculoskeletal and Connective Tissue,” under the core area “Physiological/Clinical outcomes” according to the COMET taxonomy of outcomes [13].
Machine learning model development
Dataset preparation for machine learning
For ML training, we selected studies with clearly marked ‘Results’ and ‘Discussion’ sections to ensure consistent outcome extraction. Out of the original 149 studies focused on LLLS in human subjects, we excluded 35 that didn’t clearly separate these sections, such as case reports. This left us with a final set of 114 studies containing 2,941 outcomes used for training and validation. To teach the model how to distinguish relevant outcomes from unrelated text (negative examples), we randomly extracted non-outcome phrases from the same articles. Non-outcome phrases are here defined as noun phrases that semantically differ from the verbatim outcomes appearing outside of the ‘Results’ and ‘Discussion’ sections.
Model development and evaluation
We applied an ML system to process full-text PDF articles by extracting verbatim outcomes and assigning them to predefined COMET domains. This system incorporates two distinct ML models: one for extracting verbatim outcomes and another for classifying them into the appropriate outcome domains.
As both models classify single texts, noun phrases were selected as the textual units for classification, as verbatim outcomes often take the form of noun phrases. These noun phrases were extracted from the PDF text using spaCy [15, 16].
Outcome extraction
For outcome extraction, we used the extracted outcome and non-outcome phrases, to train a binary text classifier capable of distinguishing between the two classes of text, using the linguistic features of the phrases as predictors for whether a phrase constitutes an outcome. Specifically, we used the SetFit framework [17] to fine-tune a pre-trained Sentence-BERT (SBERT) model (gte-base) [18]. The gte-base model was selected as it offered a good trade-off between benchmark performance and model size, making it computationally efficient while maintaining strong representational capacity [19]. Fine-tuning was performed via contrastive learning to adapt the embeddings to the task of identifying verbatim outcomes, using 2 epochs, a batch size of 64, and a learning rate of 1.5e-5. After fine-tuning, the adjusted SBERT model generated task-specific embeddings, which served as input features for a logistic regression classifier. The classifier was trained for 200 iterations with L2 regularization and a liblinear solver, with all hyperparameters derived through systematic hyperparameter optimization, to predict whether a given noun phrase represented a verbatim outcome.
To evaluate whether the outcome extraction model could be effectively trained on a small number of studies to support automated outcome extraction across a larger body of literature, we systematically trained and assessed the model using varying sample sizes ranging from 5 to 85 articles, with a 5-article increment. In each increment, the phrases of the articles were used both as fine-tuning data for adjusting embeddings, and as supervised training set for training a classifier using the adjusted embeddings. Each sample of articles used an 80-20 split with 80% for training and 20% for evaluation to monitor model performance during training, with each sample containing an equal number of outcome and non-outcome phrases, and all models being tested against the same hold-out set of 28 articles. Model performance in terms of making discrete predictions for each training set was then assessed using standard ML metrics: accuracy (the proportion of correct predictions), precision (the proportion of correctly identified outcomes among all predicted outcomes), recall (the proportion of actual outcomes correctly identified).
Next, we defined a sufficient training size as the smallest number of articles for which all evaluation metrics consistently exceeded 90%. A stratified five-fold cross-validation was then performed to assess the robustness of the model trained on this selected dataset size and to ensure the results were not dependent on a particular data split. After selecting the training size, the outcome extraction model was tested using the 28-article hold-out set, set aside from the initial 114 studies, to evaluate its performance on unseen data.
All models and analysis were prepared using Python and R scripting with the code available on GitHub1.
Outcome domain classification
The outcome classification model was trained on the same articles as the outcome extraction model, ensuring consistency across both tasks and enabling evaluation of successful classification with a minimal number of training articles. The model used the same architecture as the outcome extraction model, leveraging the same pre-trained SBERT model [17] and identical training parameters, differing only in the classification task: binary classification for outcome extraction and multi-class classification based on the COMET taxonomy for outcome classification. The training data comprised manually extracted verbatim outcomes and their corresponding COMET domain assignments.
We tested the model against the 28-article hold-out set and assessed the model’s performance using weighted averages of standard ML metrics (Fig. 1).
Fig. 1.
System Overview: PDF texts are processed with spaCy [15, 16] to extract textual units, consisting of non-outcome phrases (gray circles) and candidate outcome phrases (colored circles). A fine-tuned, pre-trained SBERT-based model [17] extracts outcomes by filtering out non-outcomes, retaining only the outcome phrases (colored circles). A separate SBERT-based model [17], fine-tuned for outcome classification, assigns these outcomes to domains according to the COMET taxonomy [13]. Created in BioRender. Yalcinkaya, A. (2025) https://BioRender.com/1iffebk
System integration
Once both models were trained, they were integrated into a unified system designed to automate the extraction and classification of outcomes from full-text articles. The system followed a sequential process: (1) parsing the text layer of the PDF, (2) identifying the results section, (3) extracting noun phrases from this section, (4) applying the extraction model to identify verbatim outcomes, and (5) using the classification model to assign each outcome to a predefined COMET domain [13]. The overall performance of the system was then evaluated by comparing its automatically extracted and categorized outcomes with those manually identified in a prior scoping review using the same dataset [14].
Results
Determining a cut-off for manually annotated data in outcome extraction
The outcome extraction model’s performance was evaluated by progressively increasing the number of training articles with a 5-article increment. Fig. 2 (Supplementary table 1) shows the evolution of model performance based on the number of articles used for training. The best empirical accuracy, reflected by an F1-score of 98%, was observed at 75 training articles. However, the model already demonstrated stable and high performance at n = 20, demonstrating an F1-score of 94%, with both precision and recall being 95% and 94% respectively (Fig. 2, supplementary table 1). Given that additional training yielded minimal improvement beyond this point, we selected 20 articles as the sufficient dataset size. When validated through stratified five-fold cross-validation, this model consistently achieved high average precision (95%), recall (97%), and F1-score (96%), as well as demonstrating good ability to clearly discriminate between positives and negatives, with an average AUROC of 98% and average AUPRC of 97%.
Fig. 2.
Model Performance Across Varying Training Sizes. For detailed performance metrics, see Supplementary Table 1. Created in Flourish. https://flourish.studio [20]
When tested on the hold-out validation dataset, the final outcome extraction model achieved an accuracy of 94%, with a precision of 93%, recall of 95%, and an F1-score of 94% for identifying outcome text. For non-outcome text, the model achieved a precision of 95% and recall of 93%. Overall, the model was able to clearly distinguish between the two classes, achieving an AUROC score of 97% and an AUPRC score of 96%.
When applied to the full dataset of 114 articles, the model successfully identified 94% of all manually extracted verbatim outcomes.
Outcome domain classification
The outcome domain classification model aimed to accurately assign each identified outcome into its respective COMET taxonomy domain. Using the same 20-article dataset employed for outcome extraction training, the model was integrated into the pipeline for a comprehensive evaluation. Overall, the model achieved a weighted-average F1-score of 86%, precision of 87%, recall of 88%, and AUROC of 95%.
Performance slightly varied across individual outcome domains (Fig. 3, supplementary table 2). High classification accuracy was observed across several domains, including Personal circumstances, Global quality of life, and Nervous system outcomes, each achieving 100% accuracy. Musculoskeletal and connective tissue outcomes, the largest category, also demonstrated robust accuracy at 90% (2159 correctly classified out of 2394). Infection and infestation outcomes (95%, 141 out of 148) and Physical functioning (95%, 95 out of 100) similarly showed high accuracy. Other domains exhibited somewhat lower accuracy, such as Adverse events (78%, 184 out of 237) and Delivery of care (88%, 15 out of 17).
Fig. 3.
Comparison of outcome domain frequency and distribution between manual and machine learning (ML) classification methods. For detailed frequency counts and classification results, see Supplementary Table 2. Created in Flourish. https://flourish.studio [20]
In assessing the success of our model, we specifically examined whether the ML approach yielded a similar distribution of outcome domains compared to manual classification, as accurately identifying both the most prevalent domains and the overall distribution across all domains is essential for the robustness and validity of the COS. The ML classification successfully identified the same top outcome domains as the manual method, although with some minor variations in their ranking. For example, “Emotional functioning” was classified as more prevalent by the ML method compared to “Nervous system outcomes,” and the rankings of “Physical functioning” and “Infection and infestation outcomes” were swapped. Nevertheless, the overall similarity in distribution of the most prevalent outcome domains was also consistently observed across other outcome domains (Fig. 3, supplementary table 2).
Discussion
In this study, we present the first application exploring how ML can substantially reduce the manual effort involved in COS development by integrating ML models with human-in-the-loop input. Unlike previous approaches relying heavily on extensive annotation and large datasets, our primary objective was to evaluate whether a practical ML pipeline could be developed using minimal manual annotation. Specifically, this pipeline enabled researchers to reduce the required annotation workload dramatically from reviewing all 149 articles down to just 20. Despite this significant reduction in manual effort, the model successfully identified 94% of manually extracted verbatim outcomes. In addition, we successfully identified all outcome domains that are likely to constitute the final core outcome set.
Furthermore, the outcome extraction phase consistently achieved precision, recall, and F1-scores above 90% even with limited training data. The outcome domain classification achieved a weighted-average F1-score of 86%, with precision and recall scores of 87% and 88%, respectively, highlighting the robustness of the approach. This is particularly noteworthy given that using such a granular taxonomy (34-item) [13] for classification has been considered computationally demanding in the literature [21, 22]. Additionally, the ML model’s distribution and frequency of identified outcome domains closely mirrored those of manual classification, underscoring its accuracy and robustness. By demonstrating the feasibility of reliable transfer learning approaches with limited manual input, this study provides practical insights into achieving strong model performance while greatly enhancing reproducibility, scalability, and feasibility for systematic outcome extraction and domain classification in COS development.
Compared to the literature, our human-in-the-loop approach offers improved performance while requiring substantially less manual annotation. Prior research on automated outcome extraction and classification has often required large training datasets and produced variable performance levels depending on the task and corpus analyzed. For instance, ExaCT, an automated system for extracting clinical trial characteristics, achieved recall and precision between 88 and 97% for specific trial features but required extensive structured input data [11, 23]. Similarly, BioBERT-based approaches for outcome phrase detection from scientific abstracts reported F1-scores between 60% and 81.5%, and between 77.5% and 82.2% for outcome classification, though achieving these results required substantial annotated training data [24, 25], while studies focused on full-text articles achieved F1-scores ranging from 78% to 100% depending on outcome type and task complexity [26]. Other studies focusing on abstract-level extraction have reported F1-scores ranging from 42% to 88% [27–31]. In contrast, our pipeline consistently surpassed 94% F1-score for outcome extraction with only 20 manually annotated articles, highlighting the potential of transfer learning and fine-tuning approaches for COS applications.
Based on performance metrics, the model demonstrates strong capabilities in both extracting outcomes and classifying them into the appropriate COMET domains [13]. A notable limitation is that certain outcome domains rarely reported in the literature were not represented in the training data, leading to occasional classification gaps due to class imbalance (Table 3). Class imbalance is a well-known challenge in machine learning, as models tend to be biased toward majority classes, which can negatively affect performance on underrepresented categories. Despite this inherent difficulty, our model reproduced the overall distribution of classes reasonably well, indicating that it was able to capture the dominant structure of the data even in the presence of imbalance.
Importantly, this limitation has a negligible effect on the model’s utility for COS development. While the goal of the initial scoping stage is to create a comprehensive list of outcomes, the focus is naturally on the more frequently reported domains that are most likely to be considered for the final COS [32, 33]. Our ML approach showed high concordance with the manual review, and critically, it successfully identified all the domains that are the most probable candidates for inclusion in a COS. Furthermore, the overall distribution of domains identified by the model closely mirrored that of the manual review. Therefore, the model serves as a robust tool for reliably identifying the full spectrum of outcomes most pertinent to this critical step in COS development.
The number of extracted and correctly classified phrases identified by the ML model was higher than the number of manually extracted outcomes. This was primarily due to the model extracting multiple semantically similar phrases as distinct entries. For example, phrases such as “pin site infection,” and “pin tract infection” in the same article were all extracted separately by the model, whereas they were counted as a single outcome during manual extraction. In addition, when the same phrase appeared repeatedly within a text, such as in different sections, tables, or captions, the model extracted each instance separately, contributing further to the total count.
The findings of this study have significant practical implications for evidence synthesis and COS development processes. The ability to automate much of the extraction and classification process with minimal human input offers substantial efficiency gains and could accelerate the generation of outcome lists for use in consensus-building exercises such as Delphi surveys. Previous studies have emphasized the critical role of human oversight in automated processes [21, 22], and our results corroborate these findings.
Moreover, the proposed pipeline circumvents several well-described challenges associated with large language models (LLMs), including hallucination risk, model and version drift, privacy and governance constraints (e.g., paywalled data, licensing, and access restrictions), and high computational costs, each of which can complicate deployment and maintenance [34]. In contrast, our a fine-tuned SBERT (gte-base) with a supervised logistic-regression classifier achieves strong performance with modest data while remaining open-source, lightweight, and reproducible. These properties, a small footprint and transparent mapping from embeddings to labels, align with COS workflows and human-in-the-loop oversight, and may mitigate practical barriers to LLM deployment in research contexts [7, 12].
Although not prospectively designed as an efficiency study, we provide a post-hoc estimate for both outcome extraction and classification. Manual processing typically requires ~1.5–2 hours per article, implying ~225–300 hours for the 149-article corpus. By contrast, annotating ~20 articles for training takes ~30–40 hours; thereafter, the remaining 129 full texts were processed in <10 minutes on our HPC node, yielding an estimated ~82–90% reduction in manual hours, thereby supporting scalability to larger corpora and facilitating routine COS updates. These figures are estimates rather than prospectively measured timings; actual times may vary with article length/formatting and reviewer experience. A prospective time-and-motion study would refine these estimates and quantify variability.
Limitations
This study has several limitations. First, the ML pipeline and models were developed using data primarily from the musculoskeletal domain. As such, we do not claim that the model’s performance would generalize to other areas of medical literature without retraining or domain-specific adjustments. Second, while outer loop cross-validation can enhance generalizability, we opted not to include it. Given the consistent performance metrics observed across the dataset, the random selection of articles for training and evaluation, and the use of an independent test set, we believe additional outer validation would have contributed limited added value to the robustness of our findings.
Conclusion
The findings of this study underscore the transformative potential of integrating ML methodologies into COS development. By automating the extraction and classification of outcomes, our ML-driven approach significantly reduces the manual effort traditionally required, markedly decreasing researchers’ workload while maintaining high accuracy and consistency. A key finding was the close mirroring of both the distribution and frequency of outcome domains identified by the ML model compared to manual methods, affirming its reliability. Consequently, this streamlined process offers greater efficiency, scalability, and reproducibility, ultimately facilitating more effective and standardized development of COSs.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
This study was undertaken as part of the RECONS initiative (Reconstructive Orthopaedics Core Outcomes and Standardisation Network), an international effort to standardise outcome reporting and strengthen patient-centred research in reconstructive orthopaedic surgery. For further details, visit https://www.therecons.com. This work was partially supported by DeiC National HPC (g.a. DeiC-AAU-L1-132409). All of the computation done for this project was performed on the UCloud interactive HPC system, which is managed by the eScience Center at the University of Southern Denmark.
Authors’ contributions
AY and KGK conceived the study, designed the methodology, and conducted the data analysis. AY curated the dataset, performed the analyses, created the data visualizations, and drafted the manuscript. KGK developed and implemented the machine learning models, contributed to data analysis and visualization, and co-drafted the manuscript. SG assisted with data visualization, contributed to data analysis, and co-drafted the manuscript. OR, SK, and HCH provided domain expertise, contributed to study design and methodology, supervised the project, and critically revised the manuscript for important intellectual content. All authors reviewed and approved the final version of the manuscript.
Funding
None.
Availability of data and materials
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable (no individual person data).
Competing interests
None.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Ali Yalcinkaya and Kristian Gade Kjelmann contributed equally to this work and share joint first authorship.
References
- 1.Boers M, Kirwan JR, Wells G, Beaton D, Gossec L, D’Agostino MA, et al. Developing core outcome measurement sets for clinical trials: OMERACT filter 20. J Clin Epidemiol. 2014;67:745–53. [DOI] [PubMed] [Google Scholar]
- 2.Kirkham JJ, Williamson P. Core outcome sets in medical research. BMJ Med. 2022;1(1):e000284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Chiarotto A, Ostelo RW, Turk DC, Buchbinder R, Boers M. Core outcome sets for research and clinical practice. Braz J Phys Ther. 2017;21(2):77–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Williamson PR, Altman DG, Bagley H, Barnes KL, Blazeby JM, Brookes ST, et al. The COMET Handbook: version 1.0. Trials. 2017;18(3):1–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Aquilina AL, Claireaux H, Aquilina CO, Tutton E, Fitzpatrick R, Costa ML, et al. What outcomes have been reported on patients following open lower limb fracture, and how have they been measured? Bone Joint Res. 2023;12(2):138–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Craxford S, Marson BA, Ollivere B. Outcome sets in orthopaedics: defining ‘what’ and ‘how’ to measure. Bone & Joint 360. 2023;12(4):6–9. [Google Scholar]
- 7.Tsafnat G, Glasziou P, Choong MK, Dunn A, Galgani F, Coiera E. Systematic review automation technologies. Syst Rev. 2014;3(1):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Schmidt L, Finnerty Mutlu AN, Elmore R, Olorisade BK, Thomas J, Higgins JPT. Data extraction methods for systematic review (semi)automation: update of a living systematic review. F1000Res. 2023;10:401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Veen K, Joseph A, Sossi F, Jaber PB, Lansac E, Das-Gupta E, et al. Standardized approach to extract candidate outcomes from literature for a standard outcome set: a case- and simulation study. BMC Med Res Methodol. 2023;23(1):1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Adamson B, Waskom M, Blarre A, Kelly J, Krismer K, Nemeth S, et al. Approach to machine learning for extraction of real-world data variables from electronic health records. Front Pharmacol. 2023;14:1180962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Jonnalagadda SR, Goyal P, Huffman MD. Automating data extraction in systematic reviews: a systematic review. Syst Rev. 2015;4(1):1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gartlehner G, Kahwati L, Hilscher R, Thomas I, Kugley S, Crotty K, et al. Data extraction for evidence synthesis using a large language model: a proof-of-concept study. Res Synth Methods. 2024;15(4):576–89. [DOI] [PubMed] [Google Scholar]
- 13.Dodd S, Clarke M, Becker L, Mavergames C, Fish R, Williamson PR. A taxonomy has been developed for outcomes in medical research to help improve knowledge discovery. J Clin Epidemiol. 2018;96:84–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Yalcinkaya A, Rahbek O, Tirta M, Jepsen JF, Rathleff MS, Iobst C, et al. Outcomes and outcome measurement instruments in lower-limb lengthening surgery: a scoping review to inform core outcome set development. Acta Orthop. 2024;95:715–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Montani I, Honnibal M, Honnibal M, Boyd A, Landeghem S Van, Peters H. explosion/spaCy: v3.7.2: Fixes for APIs and requirements . 2023 10.5281/zenodo.10009823
- 16.Honnibal M, Montani I, Van Landeghem S, Boyd A. spaCy: Industrial-strength natural language processing in python. 2020
- 17.Tunstall L, Reimers N, Eun Seo Jo U, Bates L, Korat D, Wasserblat M, et al. Efficient Few-Shot Learning Without Prompts. 2022 Sep [cited 2025 May 15]. Available from: https://arxiv.org/pdf/2209.11055
- 18.Li Z, Zhang X, Zhang Y, Long D, Xie P, Zhang M, et al. Towards General Text Embeddings with Multi-stage Contrastive Learning. 2023 Aug [cited 2025 May 15]. Available from: https://arxiv.org/pdf/2308.03281
- 19.Enevoldsen K, Chung I, Kerboua I, Kardos M, Mathur A, Stap D, et al. MMTEB: MASSIVE MULTILINGUAL TEXT EMBEDDING BENCHMARK. 13th International Conference on Learning Representations, ICLR 2025. 2025 [cited 2025 Sep 26]. ;102004–60.
- 20.Flourish | Data Visualization & Storytelling [Internet]. [cited 2024 Jun 30]. Available from: https://app.flourish.studio/login
- 21.Bharadwaj S, Laffin M, Hamilton A. Automating the Compilation of Potential Core-Outcomes for Clinical Trials. 2021 Jan [cited 2025 Jan 27]. Available from: https://arxiv.org/abs/2101.04076v1
- 22.Yin H, Wang Q, Zheng K, Li Z, Zhou X. Overcoming data sparsity in group recommendation. IEEE Trans Knowl Data Eng. 2020;34(7):3447–60. [Google Scholar]
- 23.Kiritchenko S, De Bruijn B, Carini S, Martin J, Sim I. ExaCT: Automatic extraction of clinical trial characteristics from journal publications. BMC Med Inform Decis Mak. 2010. 10.1186/1472-6947-10-56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Abaho M, Bollegala D, Williamson PR, Dodd S. Assessment of contextualised representations in detecting outcome phrases in clinical trials. 2022 Feb 10.24105/ejbi.2021.17.9.53-65
- 25.Abaho M, Bollegala D, Williamson P, Dodd S. Detect and Classify - Joint Span Detection and Classification for Health Outcomes. EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings. 2021; 8709–21.
- 26.De Bruijn B, Carini S, Kiritchenko S, Martin J, Sim I. Automated information extraction of key trial design elements from clinical trial publications. AMIA Annual Symposium Proceedings. 2008; p 141. [PMC free article] [PubMed]
- 27.Huang KC, Chiang IJ, Xiao F, Liao CC, Liu CCH, Wong JM. Pico element detection in medical text without metadata: are first sentences enough? J Biomed Inform. 2013;46(5):940–6. [DOI] [PubMed] [Google Scholar]
- 28.Huang KC, Liu CCH, Yang SS, Xiao F, Wong JM, Liao CC, et al. Classification of PICO elements by text features systematically extracted from PubMed abstracts. Proceedings - 2011 IEEE International Conference on Granular Computing, GrC 2011. 2011;279–83.
- 29.Boudin F, Nie JY, Bartlett JC, Grad R, Pluye P, Dawes M. Combining classifiers for robust PICO element detection. BMC Med Inform Decis Mak. 2010;10(1):1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Summerscales R, Argamon S, Hupert J, Schwartz A. Identifying Treatments, Groups, and Outcomes in Medical Abstracts. 2009.
- 31.Summerscales RL, Argamon S, Bai S, Hupert J, Schwartz A. Automatic Summarization of Results from Clinical Trials. IEEE International Conference on Bioinformatics and Biomedicine. 2011;372–7.
- 32.Brown V, Moodie M, Sultana M, Hunter KE, Byrne R, Zarnowiecki D, et al. A scoping review of outcomes commonly reported in obesity prevention interventions aiming to improve obesity-related health behaviors in children to age 5 years. Obes Rev. 2022. 10.1111/OBR.13427. [DOI] [PubMed] [Google Scholar]
- 33.Brown V, Moodie M, Sultana M, Hunter KE, Byrne R, Seidler AL, et al. Core outcome set for early intervention trials to prevent obesity in childhood (COS-EPOCH): Agreement on “what” to measure. Int J Obes. 2022;46(10):1867–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yu E, Chu X, Zhang W, Meng X, Yang Y, Ji X, et al. Large language models in medicine: applications, challenges, and future directions. Int J Med Sci. 2025;22(11):2792. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



