Abstract
Objective
This study evaluates the feasibility of employing DeepSeek-R1 for automated scoring in examinations for radiology residents, comparing its performance with that of radiologists.
Methods
A cross-sectional study was undertaken to assess 504 diagnostic radiology reports produced by eighteen third-year radiology residents. The evaluations were independently conducted by Radiologist A, Radiologist B, and DeepSeek-R1 (as of June 15, 2025), utilizing standardized scoring rubrics and predefined evaluation criteria. One month after the initial evaluation, a re-assessment was performed by DeepSeek-R1 and Radiologist A. The inter-rater reliability among Radiologist A, Radiologist B, and DeepSeek-R1, in addition to the test-retest reliability, was analyzed using intraclass correlation coefficients (ICC).
Results
The ICC values between DeepSeek-R1 and Radiologist A, DeepSeek-R1 and Radiologist B, and Radiologist A and Radiologist B were found to be 0.879, 0.820, and 0.862, respectively. The test-retest ICC for DeepSeek-R1 was determined to be 0.922, whereas for Radiologist A, it was 0.952. The ICC between DeepSeek-R1 (re-test) and Radiologist A (re-test) was 0.885.
Conclusion
The performance of DeepSeek-R1 was comparable to that of radiologists in the evaluation of radiology residents' reports. The integration of DeepSeek-R1 into medical education could effectively assist in assessment tasks, potentially alleviating faculty workload while preserving the quality of evaluations.
Keywords: Artificial intelligence, DeepSeek-R1, Residents, Evaluation
Introduction
Recently, large language models (LLMs) have made significant advancements in the field of natural language processing (NLP), becoming a central focus of artificial intelligence (AI) research [1]. Trained on extensive textual corpora, these models demonstrate exceptional proficiency in language comprehension and generation, enabling a wide array of applications such as question answering, text synthesis, and conversational AI systems [2]. A significant milestone in this development was the 2022 release of ChatGPT, which demonstrated unprecedented capabilities in human-AI interaction through large-scale deep learning [2]. Subsequently, leading Chinese technology companies, including Baidu (WenXinYiYan), Alibaba (TongYiQianWen), and ByteDance (DouBao), have made significant contributions to LLMs innovation. Notably, DeepSeek, introduced in 2025 by Deep Explorations, has distinguished itself through state-of-the-art performance in mathematical reasoning, code generation, and context-aware dialogue [3–6]. In addition to their general-purpose applications, LLMs exhibit exceptional adaptability within specialized domains. In the healthcare sector, these models have been utilized to disseminate health information [7, 8], assist in medical diagnosis [1, 9, 10], inform clinical decision-making [11], analyze and classify textual data [3–6, 12, 13]. In the field of education, LLMs have demonstrated powerful performance [14, 15], contributed to personalized learning [16], and fostered pedagogical innovation [12, 13], thereby enhancing student engagement and knowledge retention. Recent studies underscore their potential in medical education, particularly in enhancing diagnostic accuracy [4] and clinical reasoning skills [1]. These advancements suggest promising opportunities for the broader integration of AI in healthcare training, offering transformative potential for the future of medical education. While most previous research has focused on human evaluation of LLMs-generated information [1, 4, 9, 17], relatively few studies have explored whether LLMs can directly evaluate human-generated textual content. We hypothesized that DeepSeek-R1 would demonstrate comparable reliability to human raters in evaluating radiology residents’ written reports.
Subjects and methods
Subjects
This cross-sectional study was conducted at the Fifth Affiliated Hospital of Sun Yat-sen University in Zhuhai, China, focusing on radiology residents. These residents had completed either a five-year medical education program or a five-year medical education plus a three-year basic postgraduate education, and they were required to undergo an additional three years of standardized clinical specialty training before practicing radiology independently. The inclusion criteria were as follows: (1) Participants must be officially registered radiology residents; (2) they must have completed their residency training in full accordance with the institutional curriculum and be preparing for the upcoming completion examination. Following a systematic screening process, a cohort of eighteen residents, comprising eleven males and seven females aged between 26 and 27 years, was enrolled. As the study did not involve human biological research, the Institutional Review Board of the Fifth Affiliated Hospital of Sun Yat-sen University granted an ethics approval exemption.
Methods
The researchers selected disease types according to the Content and Standards for Standardized Training of Radiology Residents (2023 Edition). Each case included key information such as medical history, symptoms, physical signs, and laboratory findings, along with high-resolution X-ray images. A total of twenty-eight cases were included from the department’s self-built case database, with all patient identifiers removed. The relevant materials were compiled in a unified format and presented via PowerPoint. Residents were tasked with generating diagnostic reports. In accordance with the requirements of the residency training completion examination, each report required two components: (1) localization diagnosis, specifying the anatomical site of the lesion; and (2) qualitative diagnosis, determining the pathological nature of the abnormality. The reports generated by the residents were in paper format and were subsequently transcribed into electronic versions by the researchers (Xiaobin Liu and Lanfang Huang). Standardized reference reports were jointly established by two radiologists (Radiologist A, Guojie Wang: Associate Chief Physician with 17 years of experience; and Radiologist B, Yingqin Li: Associate Chief Physician with 23 years of experience). The scoring criteria were based on the 2023 Guangdong Provincial Residency Training Completion Examination Standards for Radiology, details are shown in Table 1. Each report had a maximum score of 20 points. Radiologist C (Shanshan Niu, Attending Physician with 11 years of experience) uploaded the scoring criteria, standard reports, and resident diagnostic reports as separate attachments to DeepSeek-R1 (June 15, 2025). Pilot scoring was conducted by entering prompt sentences in the dialogue bar, selecting the Deep-Thinking mode instead of web search. The content and format of the prompt were iteratively refined based on the scoring results until the scores it generated closely approximated the radiologists’ scores, as measured by the intraclass correlation coefficients (ICC). The final optimized prompt was as follows: “You are an examiner for the radiology residency training completion examination. Your task is to first study and understand the scoring criteria, then reference the standard reports to sequentially score the resident diagnostic reports, pointing out specific issues in each response. Each case is worth twenty points: ten for localization diagnosis and ten for qualitative diagnosis, with a total maximum score of 560 points. Note that you need to understand certain disease concepts and classifications. For example, malignant bone tumors include osteosarcoma, Ewing’s sarcoma, chondrosarcoma, etc. If the standard report is osteosarcoma, but the resident diagnoses a malignant tumor or Ewing’s sarcoma, partial credit may be awarded. For the localization part, approximate conceptual matches should receive appropriate credit. If a resident’s report includes a qualitative result (e.g., in Case 3, mentioning ‘metastatic tumor’), award 8–10 points for the qualitative section. The final scores should be presented in a table format, listing the localization score, qualitative score, and total score for each case.”
Table 1.
Scoring criteria for diagnostic radiology reports
| Scoring Dimension | Scoring Criteria | Scoring Range | Scoring Explanation |
|---|---|---|---|
| Localization Diagnosis | Complete anatomical location description is correct | 10 points | Full accuracy in describing anatomical location |
| Partially correct (deviation does not affect clinical treatment strategy) | 1–9 points | Scored according to proximity to correct location | |
| Left-right orientation error | 0 points | Incorrect left/right designation | |
| Qualitative Diagnosis | Complete pathological nature description is correct | 10 points | Full accuracy in describing pathological nature |
| Partially correct (deviation within reasonable range) | 1–9 points | Scored according to proximity to correct diagnosis | |
| Complete pathological nature error | 0 points | Incorrect pathological nature judgment | |
| Special Circumstances | Missed diagnosis of main lesions | 0 points | Failure to answer question |
| Misspelled words | Deduct 2 points | Incorrect characters | |
| Incomplete diagnostic terminology | Deduct 1–3 points | Missing key qualifiers (e.g., “anterior rib”) | |
| Incorrect tumor type | Max 5 points deducted | Error in tumor classification |
For each formal scoring session using DeepSeek-R1, a new chat session was initiated to avoid potential influence from previous ratings. After anonymization, Radiologist A and Radiologist B independently scored 504 reports according to the scoring criteria. Both radiologists were blinded to the scores assigned by DeepSeek-R1. One month later, Radiologist C used DeepSeek-R1 (July 15, 2025) to re-score the reports (Re-test), and Radiologist A also performed a re-scoring simultaneously. The detailed study flowchart is presented in Fig. 1..
Fig. 1.
Study flowchart
Data analysis
Statistical analysis was performed using SPSS 26.0 software (IBM SPSS 26.0, SPSS Inc.). The ICC was used to assess the consistency of ratings between DeepSeek-R1, Radiologist A, and Radiologist B, as well as the test-retest reliability of DeepSeek-R1 and Radiologist A at different time points. The ICC was calculated using a two-way random-effects model based on “absolute agreement,” reporting single-measure values along with their 95% confidence intervals (95% CI). An ICC of 0.75 or greater was considered to indicate excellent reliability, 0.60–0.74 good, 0.41–0.59 fair, and less than 0.40 poor [18]. A p-value of < 0.05 was considered statistically significant.
Results
The scores assigned by DeepSeek-R1 to one resident’s reports were presented in Table 2. A comparison of the scores given by DeepSeek-R1 and Radiologists for the same report was shown in Table 3. In the initial assessment, the ICC among DeepSeek-R1, Radiologist A, and Radiologist B exceeded 0.75, indicating excellent reliability. Both DeepSeek-R1 and Radiologist A also demonstrated strong test-retest reliability, with ICC exceeding 0.90 between their first and second assessments. Detailed results were provided in Table 4.
Table 2.
An example of the scoring by DeepSeek-R1 for a resident
| Question Number | Standard Reports | Resident’s Reports | Localization Score | Qualitative Score | Total Score |
|---|---|---|---|---|---|
| 1 | Secondary tuberculosis of the left upper lobe | Secondary tuberculosis of left upper lung field | 10 | 10 | 20 |
| 2 | Central lung cancer of right upper lobe with obstructive atelectasis | Central lung cancer of right upper lobe with atelectasis | 10 | 9 | 19 |
| 3 | Bilateral metastatic tumors in both lungs | Metastatic tumor in left middle lung field | 0 | 10 | 10 |
| 4 | Right-sided pneumothorax (lung tissue compressed by about 30–40%) | Right-sided pneumothorax (right lung compressed by about 30–40%) | 10 | 10 | 20 |
| 5 | Small pleural effusion and segmental atelectasis of the right lower lobe | Lung abscess in the lateral segment of the right middle lobe | 0 | 0 | 0 |
| 6 | Old fracture of the left 5th anterior rib | Normal chest x-ray | 0 | 0 | 0 |
| 7 | Cancer of the thoracic lower segment of the esophagus | Cancer of the thoracic middle and lower esophagus | 8 | 10 | 18 |
| 8 | Osteosarcoma of left distal femur | Osteosarcoma of left distal femur | 10 | 10 | 20 |
| 9 | Achalasia of the cardia | Achalasia of the cardia | 10 | 8 | 18 |
| 10 | Bronchial pneumonia in both lungs | Bronchial pneumonia in right lung | 5 | 9 | 14 |
| 11 | Duodenal bulb ulcer | Duodenal bulb ulcer | 10 | 10 | 20 |
| 12 | Gallstones | Gallstones | 10 | 10 | 20 |
| 13 | Pulmonary edema | Pulmonary edema with inflammation, cardiomegaly | 10 | 8 | 18 |
| 14 | Ewing’s sarcoma of the right humerus | Osteosarcoma of right mid humerus | 9 | 5 | 14 |
| 15 | Gastrointestinal perforation | Gastrointestinal perforation | 10 | 10 | 20 |
| 16 | Low small bowel obstruction | Low incomplete small bowel obstruction | 10 | 9 | 19 |
| 17 | Staghorn calculus in the right kidney | Multiple stones in right kidney, enlarged right renal shadow | 8 | 7 | 15 |
| 18 | Colles’ fracture of the right radius with ulnar styloid fracture | Distal radius fracture on the right, ulnar styloid avulsion fracture (Smith’s fracture) | 0 | 5 | 5 |
| 19 | Lobar pneumonia of the right lower lobe | Lobar pneumonia of the right lower lobe, posterior basal segment | 9 | 10 | 19 |
| 20 | Enchondroma of the proximal phalanx of the right fifth finger | Giant cell tumor of proximal phalanx of right little finger | 9 | 0 | 9 |
| 21 | Smith’s fracture of the left distal radius with ulnar styloid fracture | Smith’s fracture of left distal radius, avulsion fracture of ulna | 10 | 5 | 15 |
| 22 | Dislocation of right shoulder joint | Dislocation of right shoulder joint | 10 | 10 | 20 |
| 23 | Dislocation of right elbow joint | Dislocation of right elbow joint | 10 | 10 | 20 |
| 24 | Esophageal varices of middle and lower segments (or esophagogastric varices) | Esophageal varices of middle and lower segments | 10 | 10 | 20 |
| 25 | Atlantoaxial subluxation | Atlast-odontoid subluxation | 8 | 9 | 17 |
| 26 | Spondylolysis and grade I anterior spondylolisthesis of L5 | Bilateral spondylolysis of L5 with grade I spondylolisthesis of L5, S1 spina bifida occulta | 9 | 8 | 17 |
| 27 | Osteochondroma of left distal femur | Osteochondroma of left distal femur | 10 | 10 | 20 |
| 28 | Multiple metastatic tumors in the lumbar spine and pelvis | Metastatic tumor in lumbosacral vertebrae | 7 | 9 | 16 |
Number 26: L Lumbar, S Sacrum
Table 3.
Deepseek-R1 and radiologists scoring the same reports
| Question Number | Scorer | Standard Reports | Resident’s Reports | Localization Score | Qualitative Score | Total Score |
|---|---|---|---|---|---|---|
| 25 | Deepseek-R1 | Atlantoaxial subluxation | Atlanto-odontoid joint subluxation | 8 | 9 | 17 |
| 25 | Radiologist A | Atlantoaxial subluxation | Atlanto-odontoid joint subluxation | 10 | 10 | 20 |
| 25 | Radiologist B | Atlantoaxial subluxation | Atlanto-odontoid joint subluxation | 10 | 10 | 20 |
| 28 | DeepSeek-R1 | Multiple metastatic tumors in the lumbar spine and pelvic bones | Metastatic tumors in the lumbosacral vertebrae | 7 | 9 | 16 |
| 28 | Radiologist A | Multiple metastatic tumors in the lumbar spine and pelvic bones | Metastatic tumors in the lumbosacral vertebrae | 8 | 10 | 18 |
| 28 | Radiologist B | Multiple metastatic tumors in the lumbar spine and pelvic bones | Metastatic tumors in the lumbosacral vertebrae | 10 | 10 | 20 |
Table 4.
Intraclass correlation coefficients (ICC) of scoring between DeepSeek-R1 and radiologists
| Group | ICC | CI 95% Lower | CI 95% Upper | P value |
|---|---|---|---|---|
| DeepSeek-R1 and Radiologist A | 0.879 | 0.853 | 0.901 | <0.001 |
| DeepSeek-R1 and Radiologist B | 0.820 | 0.786 | 0.849 | <0.001 |
| Radiologist A and Radiologist B | 0.862 | 0.835 | 0.885 | <0.001 |
| DeepSeek-R1(Test) and DeepSeek-R1(Re-test) | 0.922 | 0.907 | 0.934 | <0.001 |
| Radiologist A(Test) and Radiologist A(Re-test) | 0.952 | 0.942 | 0.960 | <0.001 |
| DeepSeek-R1(Re-test) and Radiologist A(Re-test) | 0.885 | 0.861 | 0.904 | <0.001 |
CI Confidence interval
Discussion
Main findings
This study employed DeepSeek-R1 to automate the evaluation of written examinations within a radiology residency training program. An analysis of 504 X-ray diagnostic reports revealed excellent agreement between the scores generated by DeepSeek-R1 and those assigned by radiologists. In the test-retest reliability assessment, DeepSeek-R1 achieved an ICC of 0.922 between its two scoring sessions, while Radiologist A showed an ICC of 0.952. The slightly higher consistency for Radiologist A might be attributed to the relatively short interval between ratings, potentially introducing recall bias from the initial assessment.
Comparison with prior studies
Büyüktoka and colleagues developed a transformer-based BERT model tailored for local LLMs applications, specifically for scoring radiology examination requisitions. Their study demonstrated a high level of concordance between the scores assigned by the automated system and those given by radiologists [19]. This localized LLMs has the potential to be integrated into radiology workflows, thereby enhancing operational efficiency. In a similar vein, Seneviratne and collaborators employed GPT-4 to evaluate short-answer questions in a systems pharmacology course, observing a strong correlation between the scores generated by the LLMs and those assigned by independent human raters [20]. Furthermore, Runyon illustrated that, with meticulously crafted prompts and refined methodologies, LLMs can reliably assess objective structured clinical examination (OSCE) patient notes using analytical rubrics [21]. Our findings align with their research. Furthermore, such LLMs-based assessment methods can enhance evaluation efficiency across various domains. In educational contexts, they not only reduce the grading workload for instructors but also provide detailed, personalized feedback that identifies individual learning gaps, thereby offering more targeted guidance to students [20].
Limitations
This study has several limitations. First, the experts who established standard reports were also involved in the scoring process, which may introduce potential bias. Second, the resident reports were originally produced in paper format, and their transcription into electronic versions by researchers may not fully capture certain aspects of the original reports, such as the accurate representation of spelling errors. Third, the one-month interval between the two scoring sessions was short, potentially allowing radiologists’ second assessments to be influenced by their initial scores. Fourth, the study only included X-ray reports without analyzing CT or MRI reports. Additionally, while we analyzed the agreement between DeepSeek-R1 and radiologists’ scores, we did not conduct an in-depth analysis of discordant ratings nor compare the efficiency of DeepSeek-R1 versus radiologists in the scoring process.
Future research should ensure that developers of standard reports and scorers are independent. Residents should generate reports directly in electronic format. Studies with longer intervals, larger sample sizes, and more diverse report types should investigate LLMs’ performance in realistic assessment environments and whether LLMs can improve radiologists’ efficiency in evaluation tasks.
Implications for education and practice
DeepSeek-R1 demonstrated comparable accuracy to radiologists in scoring radiology diagnostic reports, confirming that LLMs can effectively evaluate human-generated text. These findings underscore the potential of AI-driven tools to enhance educational efficiency by providing consistent and scalable scoring solutions, thereby enabling educators to devote more time to teaching and providing personalized feedback. Future research should investigate the generalizability of DeepSeek-R1 across various imaging modalities (e.g., CT and MRI), multi-institutional contexts, and extended assessment periods to further validate its reliability and practicality in real-world training environments.
Authors’ contributions
Conceptualization, Guojie wang; methodology, Guojie Wang and Shanshan Niu; software, Guojie Wang and Shanshan Niu; formal analysis, Shanshan Niu; investigation, Shanshan Niu and Guojie Wang; resources, Guojie Wang and Xiaobin Liu; data curation, Shanshan Niu, Lanfang Huang and Xiaobin Liu; writing—original draft preparation, Shanshan Niu and Lanfang Huang; writing—review and editing, Guojie Wang and Yingqin Li; visualization, Guojie Wang and Yingqin Li; supervision, Guojie Wang; project administration, Guojie Wang and Shanshan Niu. All authors have read and agreed to the published version of the manuscript.
Funding
The Sun Yat-sen University Undergraduate Teaching Quality Project (2024) supported this research.
Data availability
The data are available at reasonable requests to the corresponding author.
Declarations
Ethics approval and consent to participate
Since the study did not encompass research involving human biological materials, it was granted an exemption from ethics approval by the Institutional Review Board of the Fifth Affiliated Hospital of Sun Yat-sen University.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Shanshan Niu, Xiaobin Liu and Lanfang Huang contributed equally to the work.
Contributor Information
Yingqin Li, Email: liyqin@mail.sysu.edu.cn.
Guojie Wang, Email: wanggj5@mail.sysu.edu.cn.
References
- 1.Kaygisiz ÖF, Teke MT. Can deepseek and ChatGPT be used in the diagnosis of oral pathologies? BMC Oral Health. 2025;25(1):638. 10.1186/s12903-025-06034-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Haupt CE, Marks M. AI-Generated medical Advice-GPT and beyond. JAMA. 2023;329(16):1349–50. 10.1001/jama.2023.5321. [DOI] [PubMed] [Google Scholar]
- 3.Temsah A, Alhasan K, Altamimi I, et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new Open-Source artificial intelligence frontier. Cureus. 2025;17(2):e79221. 10.7759/cureus.79221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zhou M, Pan Y, Zhang Y, et al. Evaluating AI-generated patient education materials for spinal surgeries: comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models. Int J Med Inf. 2025;198:105871. 10.1016/j.ijmedinf.2025.105871. [DOI] [PubMed] [Google Scholar]
- 5.Sandmann S, Hegselmann S, Fujarski M, et al. Benchmark evaluation of deepseek large Language models in clinical decision-making. Nat Med. 2025;31(8):2546–9. 10.1038/s41591-025-03727-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tordjman M, Liu Z, Yuce M, et al. Comparative benchmarking of the deepseek large Language model on medical tasks and clinical reasoning. Nat Med. 2025;31(8):2550–5. 10.1038/s41591-025-03726-3. [DOI] [PubMed] [Google Scholar]
- 7.Özcivelek T, Özcan B. Comparative evaluation of responses from DeepSeek-R1, ChatGPT-o1, ChatGPT-4, and dental GPT chatbots to patient inquiries about dental and maxillofacial prostheses. BMC Oral Health. 2025;25(1):871. 10.1186/s12903-025-06267-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Dincer HA, Dogu D. Evaluating artificial intelligence in patient education: DeepSeek-V3 versus ChatGPT-4o in answering common questions on laparoscopic cholecystectomy. ANZ J Surg. 2025;11. 10.1111/ans.70198. [DOI] [PubMed]
- 9.Wu X, Huang Y, He Q. A large Language model improves clinicians’ diagnostic performance in complex critical illness cases. Crit Care. 2025;29(1):230. 10.1186/s13054-025-05468-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chan L, Xu X, Lv K. DeepSeek-R1 and GPT-4 are comparable in a complex diagnostic challenge: a historical control study. Int J Surg. 2025;111(6):4056–9. 10.1097/JS9.000000000000238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Luo PW, Liu JW, Xie X, et al. DeepSeek vs chatgpt: a comparison study of their performance in answering prostate cancer radiotherapy questions in multiple languages. Am J Clin Exp Urol. 2025;13(2):176–85. 10.62347/UIAP7979. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Spitzl D, Mergen M, Bauer U, et al. Leveraging large Language models for accurate classification of liver lesions from MRI reports. Comput Struct Biotechnol J. 2025;27:2139–46. 10.1016/j.csbj.2025.05.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhang J, Liu J, Guo M, et al. DeepSeek-assisted LI-RADS classification: AI-driven precision in hepatocellular carcinoma diagnosis. Int J Surg. 2025;111(9):5970–9. 10.1097/JS9.0000000000002763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Xu P, Wu Y, Jin K, et al. DeepSeek-R1 outperforms gemini 2.0 Pro, openai o1, and o3-mini in bilingual complex ophthalmology reasoning. Adv Ophthalmol Pract Res. 2025;5(3):189–95. 10.1016/j.aopr.2025.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ramlogan S, Raman V, Ramlogan S. A pilot study of the performance of chat GPT and other large Language models on a written final year periodontology exam. BMC Med Educ. 2025;25(1):727. 10.1186/s12909-025-07195-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Temizsoy Korkmaz F, Ok F, Karip B, et al. A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection. BMC Med Educ. 2025;25(1):903. 10.1186/s12909-025-07493-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Prasad S, Langlie J, Pasick L, et al. Evaluating advanced AI reasoning models: ChatGPT-4.0 and DeepSeek-R1 diagnostic performance in otolaryngology: a comparative analysis. Am J Otolaryngol. 2025;46(4):104667. 10.1016/j.amjoto.2025.104667. [DOI] [PubMed]
- 18.Hessam S, Scholl L, Sand M, et al. A novel severity assessment scoring system for hidradenitis suppurativa. JAMA Dermatol. 2018;154(3):330–5. 10.1001/jamadermatol.2017.5890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Büyüktoka RE, Surucu M, Erekli Derinkaya PB, et al. Applying large Language model for automated quality scoring of radiology requisitions using a standardized criteria. Eur Radiol. 2025;20. 10.1007/s00330-025-11933-2. [DOI] [PubMed]
- 20.Seneviratne HMTW, Manathunga SS. Artificial intelligence assisted automated short answer question scoring tool shows high correlation with human examiner markings. BMC Med Educ. 2025;25(1):1146. 10.1186/s12909-025-07718-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Runyon C. Using large Language models (LLMs) to apply analytic rubrics to score post-encounter notes. Med Teach 2025:1–9. 10.1080/0142159X.2025.2504106. [DOI] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data are available at reasonable requests to the corresponding author.

