Abstract
Purpose
Imaging reports are essential for the diagnostic evaluation, treatment planning, and follow-up of patients with neuroendocrine tumors (NETs) of the gastroenteropancreatic (GEP) system. The tumor-node metastasis (TNM) classification is a common model for evaluating the prognostic value of tumor patients. However, their traditional free-text format varies in structure, detail, and clarity, leading to inconsistencies and potential omissions of critical information necessary for optimal patient management. Recent advancements in large language models (LLMs) have created new opportunities for automating complex medical assessments, including the extraction of UICC and ENETS staging classifications from imaging reports. This approach aims to improve standardization, enhance clarity, and ensure consistency, ultimately facilitating more effective multidisciplinary clinical decision-making. This study evaluates whether large language models (LLMs) can infer UICC and ENETS TNM stage for GEP‑NETs from PET/CT free‑text reports that contain descriptive findings only (no explicit TNM labels).
Methods
We evaluated several models, including ChatGPT-4o, DeepSeek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash, on a physician-generated fictitious dataset of 108 PET/CT reports with expert-annotated TNM classifications according to UICC and ENETS criteria. Model performance was assessed through F1-scores, comparing LLM-generated classifications against human expert benchmarks.
Results
Among the tested models, ChatGPT-4o demonstrated the highest accuracy, achieving microF1 scores of 0.79, 0.99 and 0.99, for T, N and M according to UICC and 0.84, 1.00 and 0.99 respectively, according to ENETS. These results indicate that LLMs have the potential to assist in oncologic staging of NETs, especially offering support for non-specialists in clinical decision-making. However, before integration into routine practice, further prospective validation and rigorous evaluation in real-world settings are necessary.
Conclusion
This study underscores the promise of LLMs in oncologic workflows while highlighting the importance of robust benchmarking and clinical validation.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12880-025-02092-3.
Keywords: Large language models, Neuroendocrine tumors, TNM staging, PET/CT, Clinical decision support
Introduction
The integration of artificial intelligence (AI) into medical workflows is transforming the landscape of clinical decision-making [1, 2]. Large language models (LLMs) have shown remarkable capabilities in processing complex medical text, raising the possibility of their use in oncologic classification [3]. Accurate tumor staging is critical for determining prognosis and guiding treatment decisions, yet, it remains a highly specialized task that demands expert interpretation [4]. Given the increasing volume of medical data, AI-driven approaches have the potential to enhance efficiency and reduce variability in staging assessments [5–7]. However, ensuring that LLMs can reliably interpret and apply structured classification systems, remains a key challenge [8]. While LLMs have demonstrated strong performance in tasks such as clinical summarization and information extraction, their ability to classify disease extent according to structured staging criteria remains largely untested [9, 10]. Gastroenteropancreatic neuroendocrine tumors (GEP-NETs) are a heterogeneous group of neoplasms with highly variable clinical behavior, ranging from small, well-differentiated lesions amenable to curative endoscopic resection to aggressive, poorly differentiated malignancies with diffuse metastases and poor survival outcomes [11, 12]. Their classification has historically been challenging and inconsistent since their initial description in 1907 [13]. PET/CT is integral to the management of GEP-NETs, supporting initial staging and restaging, guiding therapy selection (including peptide receptor radionuclide therapy), and monitoring treatment response through whole-body assessment of somatostatin receptor expression. GEP-NETs are staged using two established frameworks: the UICC TNM and the ENETS classifications. While both define stage via tumor extent, nodal involvement, and distant metastasis, ENETS incorporates site-specific criteria that can diverge from UICC; both systems are used in clinical practice, so we evaluate model performance against each. Accurate tumor classification is crucial, as even minor discrepancies in interpretation can substantially influence staging, therapeutic decisions, and overall patient management. In current clinical practice, however, such information is typically embedded within unstructured free-text reports, requiring manual extraction to obtain standardized classifications.
Artificial intelligence (AI) approaches—particularly those leveraging large language models (LLMs)—offer the potential to automate this process and improve the consistency of oncologic staging [14]. Despite rapid advances in AI for medical imaging, most prior research has centered on image-based deep learning, whereas the application of LLMs to text-based oncologic classification remains largely unexplored [15].
Materials and methods
This retrospective study was conducted under ethics approval 2024-590-S-CB from institutional review board of the Technical University of Munich, with all methods carried out in compliance with relevant guidelines and regulations. The requirement for individual informed consent was waived by the ethics committee due to the retrospective design and the use of fully anonymized clinical data.
This study examined 108 fictitious PET/CT reports with respective TNM assignmnets, according to ENET [16–20] and UICC [21] of neuroendocrine tumors generated by two nuclear medicine physicians. As only fictitious data was used, the need to obtain informed consent was waived by the institutional review board of the Technical University of Munich. The reports were systematically created to closely resemble real-world clinical reports in terms of structure, terminology, and level of detail. To ensure consistency and standardization, a predefined template was used to guide the generation of reports. Explicit TNM labels were not included in these reports. The template was designed based on established radiological reporting standards and adapted to reflect common linguistic patterns observed in real clinical documentation. Because all reports were prospectively generated for this study, no clinical time period applies; all 108 template‑conforming reports were included. The case distribution reflects a heterogeneous cohort, representative of a balanced representation of different GEP-NETs. Each report was independently reviewed by at least one additional radiologist to ensure accuracy and adherence to the standardization criteria. Discrepancies in descriptions or classifications were resolved through consensus discussions among the three medical experts. Ambiguous phrasing was included to simulate real-world variability. An example PET/CT report with corresponding ground-truth TNM classification and model-generated outputs can be seen in Supplementary Table 1.
Four models were evaluated in a zero-shot setting:
ChatGPT-4o (May 2024 version, gpt-4o-2024-05-13)
DeepSeek V3
Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)
Gemini 2.0 Flash (gemini-2.0 flash experimental)
All four models were tested in a zero-shot setting using the same standardized prompt template verbatim:
Accurately TNM classify the following imaging report of a neuroendocrine neoplasia after the newest version of UICC TNM classification Staging System
and
Accurately TNM classify the following imaging report of a neuroendocrine neoplasia after the newest version of the European Neuroendocrine Tumor Society (ENETS) Staging System.
The dataset size was selected to ensure a well-balanced representation of various GEP-NETs while maintaining practical feasibility. Including 108 NETs enabled a meaningful assessment of classification performance. To reduce potential bias towards more common NET localizations, efforts were made to achieve a representative distribution.
Performance was computed separately for T, N, and M for each system. Accuracy was defined as the proportion of cases where the exact TNM classification was correctly predicted compared to the ground truth. The Python packages NumPy (version 1.26.4), pandas (version 2.2.0), scikit-learn (version 1.4.0), statsmodels (version 0.14.1), matplotlib (version 3.8.2), and seaborn (version 0.13.2) were used for data analysis and visualization [22–26]. Each of the TNM components (T, N, and M) was evaluated as a multi-class classification task. Metrics were computed using a one-vs-rest approach for each category, meaning that misclassification from one class to another contributed to both a false positive and a false negative across the respective categories. Micro-averaging was applied to combine counts across all classes, while macro-averaging reflected the unweighted mean of class-specific results.
Precision was defined as the proportion of correctly predicted cases among all cases predicted for a given class, while recall represented the proportion of correctly predicted cases among all true cases for that class. The F1 score was computed as the harmonic mean of precision and recall. Macro-F1 was obtained as the unweighted average of F1 scores across all classes, giving each category equal importance, whereas Micro-F1 was derived from the pooled counts of true positives, false positives, and false negatives across all classes, thereby weighting the results according to class frequency.
Results
Overall, 108 PET/CT reports, performed for the staging of NETs were analyzed: These were located in the pancreas (59.3%; n = 64), duodenum (14.8%; n = 16) and ileum (25.9%; n = 28). Figure 1A shows the workflow for LLM based prediction of TNM classification.
Fig. 1.
(A) Schematic representation of the workflow for classifying NET UICC and ENETS TNM based on PET/CT reports. The process includes PET/CT report generation, LLM-based classification, and validation against ground truth. (B) Radar chart illustrating the hit rate for UICC TNM classification by ChatGPT-4o, DeepSeek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash. Key findings include ChatGPT-4o´s high F1 scores (0.79) for T classification. (C) shows a heatmap depicting the differences in performance for classifying T, N and M stage across models used
UICC TNM classification
We evaluated the accuracy of predictions for primary tumor features (T), regional lymph node involvement (N) and distant metastasis (M) (Fig. 1B). ChatGPT-4o achieved microF1 scores of 0.84, 1.00 and 0.99 respectively; DeepSeek V3 scored 0.74, 1.00 and 0.99; Claude 3.5 Sonnet obtained 0.64, 0.89 and 0.99; and Gemini 2.0 Flash achieved 0.52, 0.97 and 0.99. Further details of the macro and micro F1 scores, recall, and precision for each attribute are summarized in Tables 1, 2, 3 and 4. The different models showed relatively similar performance in identifying the correct N and M stage, however, there were significant differences in calling the correct T stage. ChatGPT-4o scored 0.29 higher in T stage classification than Gemini 2.0 Flash (Fig. 1C).
Table 1.
Overall performance of ChatGPT-4o on UICC TNM classification from PET/CT reports
| ChatGPT-4o | ||||
|---|---|---|---|---|
| Attribute | Precision | Recall | Macro F1 | Micro F1 |
| T | 0.84 | 0.83 | 0.71 | 0.79 |
| N | 0.98 | 0.99 | 0.99 | 0.99 |
| M | 0.99 | 0.98 | 0.99 | 0.99 |
| Average | 0.94 | 0.94 | 0.90 | 0.92 |
Table 2.
Overall performance of deepseek V3 on UICC TNM classification from PET/CT reports
| DeepSeek V3 | ||||
|---|---|---|---|---|
| Attribute | Precision | Recall | Macro F1 | Micro F1 |
| T | 0.65 | 0.68 | 0.55 | 0.65 |
| N | 0.99 | 0.99 | 0.99 | 0.99 |
| M | 0.99 | 0.99 | 0.99 | 0.99 |
| Average | 0.88 | 0.87 | 0.84 | 0.88 |
Table 3.
Overall performance of Claude 3.5 sonnet on UICC TNM classification from PET/CT reports
| Claude 3.5 Sonnet | ||||
|---|---|---|---|---|
| Attribute | Precision | Recall | Macro F1 | Micro F1 |
| T | 0.59 | 0.6 | 0.48 | 0.58 |
| N | 0.95 | 0.99 | 0.97 | 0.97 |
| M | 0.98 | 0.99 | 0.99 | 0.99 |
| Average | 0.84 | 0.86 | 0.81 | 0.85 |
Table 4.
Overall performance of gemini 2.0 flash on UICC TNM classification from PET/CT reports
| Gemini 2.0 Flash | ||||
|---|---|---|---|---|
| Attribute | Precision | Recall | Macro F1 | Micro F1 |
| T | 0.58 | 0.52 | 0.48 | 0.5 |
| N | 0.95 | 0.98 | 0.97 | 0.97 |
| M | 0.96 | 0.99 | 0.98 | 0.98 |
| Average | 0.83 | 0.83 | 0.81 | 0.82 |
Error analysis
To analyze the mistakes made by each model, we created confusion matrices for each model we used. For example, ChatGPT-4o (Fig. 2) confused the T2 stage for the T1 stage on 10 occasions, the M1 stage for M0 in two cases and N0 for N1 on two occasions. Confusion matrices for DeepSeek V3, Claude 3.5 Sonnet and Gemini 2.0 Flash can be seen in Suppl. Figure 1–3.
Fig. 2.
Confusion matrices for ChatGPT-4o for the key attributes of interest. (A) UICC T stage, (B) UICC N stage and (C) UICC M stage
ENETS TNM classification
We then used the four LLMs to classify the reports according to the ENETS classification system. The accuracy of primary tumor features (T), regional lymph node involvement (N), distant metastasis (M) was analyzed. ChatGPT-4o attained micro F1 scores of 0.84, 1.00, and 0.99, respectively. DeepSeek V3 recorded scores of 0.74, 1.00, and 0.99. Claude 3.5 Sonnet achieved 0.64, 0.89, and 0.99, while Gemini 2.0 Flash obtained scores of 0.52, 0.97, and 0.99. (Fig. 3)
Fig. 3.

Radar chart illustrating the hit rate for ENETS TNM classification by GPT-4o, DeepSeek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash
Further details of the macro and micro F1 scores, recall, and precision for each attribute are summarized in Suppl. Tables 1–4.
Error analysis
To evaluate the misclassifications made by each model for ENETS classification, we constructed confusion matrices for all tested models also for this classification. ChatGPT-4o mistakenly identified the T2 stage as T3 on nine occasions, confused M1 with M0 in two instances, and incorrectly labeled N0 as N1 once (Fig. 4). The confusion matrices for DeepSeek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash are available in the Supplementary Fig. 4–6.
Fig. 4.
Confusion matrices for ChatGPT-4o for the key attributes of interest. (A) ENETS T stage, (B) ENETS N stage, (C) ENETS M stage
Discussion
This study highlights the transformative potential of large language models (LLMs) in the oncologic staging of neuroendocrine tumors (NETs), presenting an innovative solution to streamline complex diagnostic workflows. By accurately determining TNM staging from imaging reports, LLMs demonstrate their ability to interpret complex clinical narratives with high fidelity. Among the models evaluated, ChatGPT-4o achieved the highest performance, with microF1 scores of 0.79, 0.99 and 0.99, for T, N and M according to UICC and 0.84, 1.00 and 0.99 respectively, according to ENETS. These metrics underscore the model’s capability to support clinical workflows, potentially enhancing diagnostic efficiency and consistency. The comparatively lower performance observed for the T component likely reflects its dependence on nuanced anatomic details such as local invasion depth, size thresholds, and site-specific classification rules, which are often described less explicitly in PET/CT narratives than nodal or metastatic findings. Moreover, PET/CT inherently provides limited morphologic resolution for subtle features of local tumor extension, and variations in phrasing—such as “abuts,” “contacts,” or “suspicious for invasion”—can introduce additional ambiguity for language models. Future work could address these challenges by incorporating structured reporting elements, refining prompt design to capture context-specific descriptors, or combining text-based approaches with complementary imaging-derived information.
Past investigations into AI-powered medical text classification have predominantly focused on structured EHR data [27]. Previous research based on free-text has illustrated the potential of LLMs in a range of radiological classification applications. For example, one study [28] demonstrated that the LI-RADS score can be automatically derived from radiology reports, enhancing consistency in liver lesion evaluations. In another study [29], LLMs were employed to conduct TNM classification for NSCLC by extracting key tumor characteristics from free-text CT reports and translating them into standardized staging information. Moreover, LLMs have been effectively utilized in brain tumor classification [30], where they extracted and synthesized complex diagnostic details from radiology reports to support accurate clinical decision-making. In the current study, we expand upon this foundation by thoroughly examining cutting-edge LLMs in a particularly difficult diagnostic environment. Rather than using structured data, we drew on unstructured PET/CT reports, which are often laden with intricate, ambiguous phrasing, uncertain clinical terminology, and implied rather than explicitly stated diagnoses [31].
By contrast, no published research to date has addressed the ENET/UICC classification using LLMs exclusively on PET/CT reports.
A key insight from our analysis is the significant variation in performance across different LLMs. This variability highlights that while some models, like ChatGPT-4o, demonstrate superior performance, the specific factors contributing to this advantage remain unclear, and others may face challenges in effectively interpreting the nuanced nature of medical text. Therefore, understanding these discrepancies is crucial, underscoring the importance of rigorous model-specific evaluations and fine-tuning when considering LLMs for clinical applications.
The variability in reporting styles among physicians presents an additional challenge for LLM performance. Differences in terminology, structure, and the level of detail in imaging reports can significantly affect the models’ ability to accurately interpret and classify staging information. Some clinicians may prefer concise summaries, while others provide extensive descriptive narratives, which can introduce inconsistencies that LLMs must navigate. Addressing this challenge requires models to be trained on diverse datasets that encompass a wide range of reporting styles, ensuring greater robustness and adaptability in real-world clinical settings.
Another important consideration is the potential for LLMs to reduce variability in staging decisions among clinicians with differing levels of expertise. By providing consistent and reproducible assessments, LLMs could help standardize oncologic staging practices. This consistency could lead to more uniform treatment decisions, ultimately contributing to improved patient outcomes. Additionally, the ability of LLMs to rapidly process large volumes of imaging report data may alleviate the growing workload faced by healthcare professionals, allowing them to focus on more complex decision-making tasks that require human judgment.
Despite these encouraging results, several limitations should be acknowledged. First, this study relied on a physician-generated synthetic dataset of 108 PET/CT reports. While the use of fictitious cases ensured that no patient-identifiable information was included, the artificial nature of these reports may limit generalizability to real-world clinical practice, where reporting style, terminology, and level of detail can vary substantially. Nevertheless, the use of expertly designed synthetic reports offers distinct advantages: it enables controlled experimentation across predefined staging scenarios, ensures standardized wording and balanced representation of disease patterns, and eliminates privacy and ethical constraints associated with patient data. This controlled design allows systematic assessment of model behavior under defined conditions and represents an essential preparatory step toward prospective validation on authentic clinical reports. All analyses were descriptive in nature, and no formal hypothesis testing between models was performed. Finally, while no fabricated findings were observed in this study, large language models are known to be prone to hallucinations; this remains a potential risk—particularly with unconstrained prompts or real-world data—highlighting the need for appropriate safeguards and human oversight.
Our promising results on these standardized, yet variably phrased, synthetic reports lay the groundwork for future investigations with genuine patient data. Ultimately, confirming that these models perform similarly well on real-world datasets is critical for translation into routine clinical use.
Conclusion
This study highlights the significant potential of LLMs, particularly GPT-4o, in automating the oncologic staging of NETs with high accuracy. The results suggest that LLMs can be valuable adjuncts in clinical decision-making, especially for non-specialists who may benefit from automated support in interpreting complex imaging reports. However, before LLMs can be integrated into routine clinical practice, prospective validation studies in diverse, real-world settings are essential. Additionally, addressing ethical, legal, and practical considerations will be key to their successful and responsible deployment. Ultimately, while LLMs hold great promise in enhancing oncologic workflows, their role should be viewed as complementary to, rather than a replacement for, expert clinical judgment.
Supplementary Information
Below is the link to the electronic supplementary material.
Author contributions
MM and DS conceived the study, designed the experiments, curated and annotated the dataset, performed the primary analyses, interpreted the results, and drafted the manuscript. ME contributed nuclear medicine/oncologic imaging expertise, advised on study design and clinical interpretation, and critically revised the manuscript. RFB supervised imaging methodology, contributed to study conception and interpretation. LS assisted with data annotation and preprocessing, supported statistical evaluation, and contributed to manuscript editing.
Funding
Open Access funding enabled and organized by Projekt DEAL. The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Declarations
Ethics approval and consent to participate
This retrospective study was approved by the Ethics Committee of the Technical University of Munich (Approval ID: 2024-590-S-CB). The requirement for individual informed consent was waived by the ethics committee due to the retrospective design and use of fully anonymized clinical data. The study was conducted in accordance with the Declaration of Helsinki and institutional guidelines.
Consent to publication
Not applicable
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Markus Mergen and Daniel Spitzl contributed equally to this work and share first authorship.
References
- 1.Khosravi M, et al. Artificial intelligence and Decision-Making in healthcare: A thematic analysis of a systematic review of reviews. Health Serv Res Manag Epidemiol. 2024;11:23333928241234863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Niraula D, et al. Intricacies of human-AI interaction in dynamic decision-making for precision oncology. Nat Commun. 2025;16(1):1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Menezes MCS, et al. The potential of generative Pre-trained transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digit Health. 2025;7(1):e35–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Nakakura EK. Challenges staging neuroendocrine tumors of the Pancreas, jejunum and Ileum, and appendix. Ann Surg Oncol. 2018;25(3):591–3. [DOI] [PubMed] [Google Scholar]
- 5.Bi WL, et al. Artificial intelligence in cancer imaging: clinical challenges and applications. CA Cancer J Clin. 2019;69(2):127–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Senthil Kumar K, et al. Artificial intelligence in clinical oncology: from data to digital pathology and treatment. Am Soc Clin Oncol Educ Book. 2023;43:e390084. [DOI] [PubMed] [Google Scholar]
- 7.Rosler W, et al. An overview and a roadmap for artificial intelligence in hematology and oncology. J Cancer Res Clin Oncol. 2023;149(10):7997–8006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Dai HJ, et al. Evaluating a natural language processing-driven, AI-assisted international classification of diseases, 10th revision, clinical modification, coding system for diagnosis related groups in a real hospital environment: algorithm development and validation study. J Med Internet Res. 2024;26:e58278. [DOI] [PMC free article] [PubMed]
- 9.Busch F, et al. Large language models for structured reporting in radiology: past, present, and future. Eur Radiol. 2024. [DOI] [PMC free article] [PubMed]
- 10.Wang L, et al. Artificial intelligence in clinical decision support systems for oncology. Int J Med Sci. 2023;20(1):79–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Modlin IM, et al. Gastroenteropancreatic neuroendocrine tumours. Lancet Oncol. 2008;9(1):61–72. [DOI] [PubMed] [Google Scholar]
- 12.Pape UF, et al. Prognostic factors of long-term outcome in gastroenteropancreatic neuroendocrine tumours. Endocr Relat Cancer. 2008;15(4):1083–97. [DOI] [PubMed] [Google Scholar]
- 13.Kloppel G, Perren A, Heitz PU. The gastroenteropancreatic neuroendocrine cell system and its tumors: the WHO classification. Ann N Y Acad Sci. 2004;1014:13–27. [DOI] [PubMed] [Google Scholar]
- 14.Zitu MM, et al. Large Language models in cancer: potentials, risks, and safeguards. BJR Artif Intell. 2025;2(1):ubae019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Bhinder B, et al. Artificial intelligence in cancer research and precision medicine. Cancer Discov. 2021;11(4):900–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.O’Toole D, Kianmanesh R, Caplin M. ENETS 2016 consensus guidelines for the management of patients with digestive neuroendocrine tumors: an update. Neuroendocrinology. 2016;103(2):117–8. [DOI] [PubMed] [Google Scholar]
- 17.Niederle B, et al. ENETS consensus guidelines update for neuroendocrine neoplasms of the jejunum and ileum. Neuroendocrinology. 2016;103(2):125–38. [DOI] [PubMed] [Google Scholar]
- 18.Garcia-Carbonero R, et al. ENETS consensus guidelines for High-Grade gastroenteropancreatic neuroendocrine tumors and neuroendocrine carcinomas. Neuroendocrinology. 2016;103(2):186–94. [DOI] [PubMed] [Google Scholar]
- 19.Delle Fave G, et al. ENETS consensus guidelines update for gastroduodenal neuroendocrine neoplasms. Neuroendocrinology. 2016;103(2):119–24. [DOI] [PubMed] [Google Scholar]
- 20.Ramage JK, et al. ENETS consensus guidelines update for colorectal neuroendocrine neoplasms. Neuroendocrinology. 2016;103(2):139–43. [DOI] [PubMed] [Google Scholar]
- 21.Bertero L, et al. Eighth edition of the UICC classification of malignant tumours: an overview of the changes in the pathological TNM classification criteria-What has changed and why? Virchows Arch. 2018;472(4):519–31. [DOI] [PubMed] [Google Scholar]
- 22.Harris CR, van der Walt MK. Array programming with numpy. Nature. 2020;585:357–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.McKinney W. Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 2010: pp. 56–61.
- 24.Pedregosa F, Gramfort VG. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
- 25.Seabold SPJ. Statsmodels: Econometric and statistical modeling with Python. Proceedings of the 9th Python in Science Conference, 2010.
- 26.Waskom M. Seaborn: statistical data visualization. J Open Source Softw. 2021;6.
- 27.Bhattarai K, et al. Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spacy’s rule-based and machine learning-based methods. JAMIA Open. 2024;7(3):ooae060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Fervers P, et al. ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German Language. Front Radiol. 2024;4:1390774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Lee JE, et al. Lung cancer staging using chest CT and FDG PET/CT Free-Text reports: comparison among three ChatGPT large Language models and six human readers of varying experience. AJR Am J Roentgenol. 2024;223(6):e2431696. [DOI] [PubMed] [Google Scholar]
- 30.Kanzawa J, et al. Automated classification of brain MRI reports using fine-tuned large Language models. Neuroradiology. 2024;66(12):2177–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Larson DB, et al. Improving consistency in radiology reporting through the use of department-wide standardized structured reporting. Radiology. 2013;267(1):240–50. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.



