Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jul 2.
Published in final edited form as: Laryngoscope. 2025 Jun 27;135(11):4119–4124. doi: 10.1002/lary.32368

Evaluating Resident Feedback Using a Large Language Model: Are We Missing Core Competencies?

Syed Ameen Ahmad 1,*, Maria Armache 2,*, Danielle R Trakimas 2, Jenny X Chen 2,**, Deepa Galaiya 2,**
PMCID: PMC12221221  NIHMSID: NIHMS2092534  PMID: 40574724

Abstract

Objectives:

Use a large language model (LLM) to examine the content and quality of narrative feedback provided to residents through: 1) an app collecting workplace-based assessments of surgical performance (SIMPL-OR), 2) Objective Structured Assessment of Technical Skills (OSATS), and 3) end-of-rotation (EOR) evaluations.

Methods:

Narrative feedback provided to residents at a single institution from 2017 to 2021 was examined. 60 entries (20 of each format) were evaluated by two faculty members on whether they were encouraging, corrective, or specific, and whether they addressed the Core Competencies outlined by the Accreditation Council for Graduate Medical Education. ChatGPT4o was tested on these 60 entries before evaluating the remaining 776 entries.

Results:

ChatGPT evaluated entries with 90% concordance with faculty (κ=0.94). Within the 776 feedback entries evaluated by ChatGPT, Competencies addressed included: patient care (n=491, 97% vs. 77% vs. 36% for SIMPL-OR, OSATS, EOR respectively, p<0.001), practice-based learning (n=175, 32% vs. 23% vs. 16%, p<0.001), professionalism (n=168, 1% vs. 6% vs. 40%, p<0.001), medical knowledge (n=95, 7% vs. 8% vs. 17%, p<0.001), interpersonal and communication skills (n=59, 3% vs. 3% vs. 12%, p<0.001) and systems-based practice (n=31, 4% vs. 2% vs. 5%, p=0.387). Feedback was “encouraging” in 93% of both SIMPL-OR and OSATS, as compared to 84% of EOR (p<0.001). Feedback was “corrective” in 71% of SIMPL-OR vs. 44% of OSATS vs. 24% of EOR (p<0.001), and “specific” in 97% vs. 53% vs. 15%, respectively (p<0.001).

Conclusion:

Different instruments provided feedback of differing content and quality and a multimodal feedback approach is important.

Level of Evidence:

N/A

Keywords: Medical Education, Resident Education, Natural Language Processing

Introduction

The Accreditation Council for Graduate Medical Education (ACGME) is responsible for setting the educational standards required to prepare resident trainees for independent practice.1 Although they mandate assessments for the Core Competencies required in all medical specialties, it is unclear what feedback tools should be used and how the data from these tools should be used. Understanding this information is critical, particularly as graduate medical education shifts towards a competency-based approach.2

In surgical subspecialties like otolaryngology-head and neck surgery (OHNS), a variety of assessment tools have been implemented at different institutions. Tools like the Objective Structured Assessment of Technical Skills (OSATS) offer insight into technical ability for a particular case or scenario,3 while end-of-rotation (EOR) surveys may offer more holistic feedback over a longer period of time.4 More recently, smartphone applications like the Society for Improving Medical Professional Learning of operating room performance (SIMPL-OR) provide an interface for efficiently assessing a resident’s performance immediately following a surgical case.57 Nevertheless, little is known about how the quality and content of the narrative components of these assessment tools differ. Having access to this information is important for several reasons. For example, it is not clear if the narrative feedback addresses ACGME core competencies, which is important for program directors and the clinical competency committee (CCC) to make assessments of trainee readiness. Additionally, it is unknown if the narrative comments are of high enough quality for the residents to integrate feedback into their own progression.

However, the greatest barrier to comparing feedback modalities is the large volumes of data that would have to be reviewed to facilitate these comparisons—a process that would be exceptionally labor-intensive with manual coding. To tackle this type of laborious work, some researchers have turned to natural language processing (NLP) models with success.8 More recently, ChatGPT (2023a) has demonstrated the ability to conduct qualitative analysis without the expertise necessary to design a specialized large language model (LLM).9 Taken together, ChatGPT holds the potential to efficiently evaluate feedback entries for content and quality.

Herein, we trained ChatGPT to evaluate the quality and content of feedback for OHNS residents via 1) SIMPL-OR, 2) OSATS, and 3) EOR surveys. We aimed to demonstrate that NLP using commercially available LLMs may be used to gather information from large volumes of assessments. We secondarily aimed to determine whether different feedback formats offer disparate insights for resident assessment.

Materials and Methods

Study Setting

This retrospective study was approved by the Johns Hopkins Institutional Review Board (IRB00358100). Consecutively collected resident feedback entries from 2017–2021 were anonymized with respect to resident and faculty identifiers and included in analyses.

Feedback Formats

The SIMPL-OR application has been described in previous studies validating its use.6,7,10 SIMPL-OR questions focused on resident autonomy, performance, and case complexity following an operation. Assessments were completed within 72 hours following a procedure. OSATS were completed within seven days of a closely observed key indicator procedure, such as a tracheostomy, meant to be representative of all procedures of that type done during a rotation. EOR surveys commented holistically on a resident’s longitudinal performance while on a three-month OHNS service. EOR feedback was either given in the “strengths” or “weaknesses” section of the assessment. For the purpose of the current study, only narrative feedback (dictated and/or written) feedback was considered.

Manual Coding

60 feedback entries (20 of each format) were randomly selected and graded by two blinded faculty members (authors JXC and DG) as to whether the feedback was encouraging, corrective, or specific. Faculty members were also instructed to determine whether feedback addressed the six core competencies of the ACGME.11 Definitions for the six competencies [patient care and procedural skills, medical knowledge, systems-based practice (SBP), practice-based learning (PBL), professionalism and interpersonal and communication skills (ICS)] were adapted from language provided by the New England Journal of Medicine (NEJM) (Supplementary Item 1).12 Both faculty members were instructed to use these definitions while evaluating feedback. Discrepancies between faculty members were discussed and resolved.

ChatGPT

The ChatGPT-4o model and “Explore” feature were used to evaluate feedback entries in November of 2024. The Explore feature allows users to customize their own unique version of ChatGPT. Using this approach, ChatGPT was provided the same prompt given to faculty members under the “Instructions” section. Additionally, a training document was uploaded to the “Knowledge” section, containing a random sample of 30 feedback entries (10 of each format) with written explanations justifying the decision-making process. These feedback entries were unique from the 60 feedback entries evaluated by faculty members. These explanations served as a template, providing ChatGPT with structured training data for evaluating subsequent feedback entries (Supplementary Item 2). As ChatGPT tends to lose training data after evaluating a large number of feedback entries, a new Explore session was started after every 60 feedback entries—matching the number initially reviewed by faculty members. While there was consideration to increase the number of feedback entries compared between faculty evaluators and ChatGPT, 60 was selected as the final number to avoid degradation in ChatGPT’s output quality. Each new session was preloaded with the same training information to ensure consistency. Before starting each session, ChatGPT’s memory was cleared to prevent prior decisions from influencing subsequent analyses. Following concordance analysis, ChatGPT analyzed the remaining feedback entries.

Statistical Analysis

Statistical analysis was performed using IBM SPSS Statistics and R. The chi-squared test of independence was used to compare feedback entries between the three feedback formats (EOR survey combining “strengths” and “weaknesses” feedback entries) and between ChatGPT to faculty evaluations. Cohen’s Kappa (κ > 0.80 = good agreement) and concordance rates were used to assess for inter-rater reliability. Statistical significance was defined as p<0.05.

Results

A total of 836 feedback entries met inclusion criteria for this study. For the 60 random feedback entries evaluated by faculty members (20 of each format), there was a high level of agreement between the two faculty members (concordance rate: 87%; κ=0.93). After resolving discrepancies, the Core Competencies addressed were (in order of prevalence): patient care (n=54 total entries, 100% vs. 95% vs. 75% for SIMPL-OR, OSATS, and EOR respectively, p=0.020), medical knowledge (n=11 total entries, 20% vs. 5% vs. 30%, p=0.121), PBL (n=10 total entries, 5% vs. 10% vs. 35%, p=0.024), ICS (n=8 total entries, 10% vs. 5% vs. 25%, p=0.153), professionalism (n=6 total entries, 5% vs. 5% vs. 20%, p=0.189), and SBP (n=3 total entries, 0% vs. 5% vs. 10%, p=0.349). Feedback was encouraging in 100% of SIMPL-OR and OSATS cases and 85% of EOR surveys (p=0.043). Feedback was corrective in 65% vs. 45% vs. 50% of cases (p=0.4), and specific in 85% vs. 60% vs. 40% of cases (p=0.014) (Table 1).

Table 1.

Categorization of feedback entries across feedback modalities.

Domain SIMPL-OR (%) OSATS (%) EOR (%) P Value
Human Evaluators (n=20 SIMPL-OR, n=20 OSATS, n=20 EOR)
 Encouraging 100 100 85 0.043
 Corrective 65 45 50 0.40
 Specific 85 60 40 0.014
 Patient Care and Procedural Skills 100 95 75 0.020
 Medical Knowledge 20 5 30 0.12
 SBP 0 5 10 0.35
 PBL 5 10 35 0.024
 Professionalism 5 5 20 0.19
 ICS 10 5 25 0.15
ChatGPT (n=269 SIMPL-OR, n=115 OSATS, n=392 EOR)
 Encouraging 93 93 84 <0.001
 Corrective 71 44 24* <0.001
 Specific 97* 53 15* <0.001
 Patient Care and Procedural Skills 97 77 36* <0.001
 Medical Knowledge 7 8 17 <0.001
 SBP 4 2 5 0.39
 PBL 32* 23 16 <0.001
 Professionalism 1 6 40 <0.001
 ICS 3 3 12 <0.001

EOR, end-of-rotation; OSATS, objective structured assessment of technical skills; PBL, practice-based learning; SBP, systems-based practice; ICS, interpersonal and communication skills; SIMPL-OR, society for improving medical professional learning of operating room performance.

*

denotes where ChatGPT proportion significantly differs from human evaluator proportion for a particular domain.

In examining this set of 60 training entries, ChatGPT also produced a high level of concordance in evaluations compared to expert faculty evaluations (concordance rate: 90%; κ=0.941). Among the remaining 776 feedback entries (269 SIMPL-OR, 115 OSATS, 392 EOR) evaluated by ChatGPT, Core Competencies addressed were (in order of prevalence): patient care (n=491 total entries, 97% vs. 77% vs. 36% for SIMPL-OR, OSATS, EOR respectively, p<0.001), PBL (n=175 total entries, 32% vs. 23% vs. 16%, p<0.001), professionalism (n=168 total entries, 1% vs. 6% vs. 40%, p<0.001), medical knowledge (n=95 total entries, 7% vs. 8% vs. 17%, p<0.001), ICS (n=59 total entries, 3% vs. 3% vs. 12%, p<0.001) and SBP (n=31 total entries, 4% vs. 2% vs. 5%, p=0.387). Feedback was encouraging in 93% of SIMPL-OR and OSATS cases and 84% of EOR surveys (p<0.001). Feedback was corrective in 71% vs. 44% vs. 24% of cases (p<0.001), and specific in 97% vs. 53% vs. 15% of cases (p<0.001) (Table 1, Figure 1).

Figure 1.

Figure 1.

Visual representation of the categorization of feedback entries across SIMPL-OR (n=269), OSATS (n=115), and EOR surveys (n=392) as assessed by ChatGPT. * denotes significant difference in the percentage of entries across the three feedback modalities.

EOR, end-of-rotation; OSATS, objective structured assessment of technical skills; PBL, practice-based learning; SBP, systems-based practice; ICS, interpersonal and communication skills; SIMPL-OR, society for improving medical professional learning of operating room performance.

When comparing the percentage of domains within a feedback modality between faculty and ChatGPT evaluations, ChatGPT tended to evaluate more SIMPL-OR feedback entries as specific (13.2% difference, p=0.035) and including the PBL domain (145.9% difference, p=0.022) when compared to human evaluators. Additionally, ChatGPT evaluated fewer EOR surveys as corrective (70.3% difference, p=0.019), specific (90.9% difference, p=0.008), and including the patient care domain (70.3% difference, p=0.001) when compared to human evaluators. To determine if these were true difference or the results of a sampling bias, an additional 60 random feedback entries (20 of each format) from the 776 entries evaluated by ChatGPT were collected and categorized by two blinded study team member (SAA and JXC). Similar to before, discrepancies were discussed and resolved. This supplementary analysis revealed similarly high levels of concordance (concordance rate: 92%; κ=0.82) with the following data: SIMPL-OR specific (100% concordance), SIMPL-OR PBL (80% concordance), EOR corrective (90% concordance), EOR specific (85% concordance), and EOR patient care (100% concordance). There were no significant differences on any other SIMPL-OR or EOR comparisons and for all OSATS comparisons.

Discussion

We successfully used ChatGPT to evaluate the quality and content of narrative feedback provided to OHNS residents using three distinct modalities: SIMPL-OR, OSATS, and EOR assessments. We determined that commercially available LLMs can categorize large volumes of qualitative information given appropriate oversight from human evaluators. Additionally, the variety of milestones addressed across the three formats suggests that different format modalities can yield unique insights into resident learning and therefore should be used concurrently.

The comparison between ChatGPT and faculty evaluations revealed mostly non-significant differences in the percentages of domains identified, suggesting that ChatGPT functioned as intended. Notably, ChatGPT’s performance on the SIMPL-OR specific domain and EOR corrective, specific, and patient care domains was largely consistent with pre-test expectations. Meaning, these observed differences likely reflect true domain-specific variations due to sampling differences rather than inaccuracies in ChatGPT’s analysis. This is further reflected by fact that the concordance rate of these domains was high on our additional supplementary analysis of 60 feedback entries. However, at least within this study, ChatGPT tended to overestimate the proportion of feedback entries classified under the PBL domain, suggesting these interpretations should be approached with caution.

In all, ChatGPT should not completely replace human raters but rather serve as an adjunct to assist with manual coding. For instance, ChatGPT’s outputs always include reasoning and directly referencing elements of the feedback entries. This allows human raters to quickly evaluate its logic, agreeing or disagreeing as needed, without requiring review of the entire feedback entry. However, we acknowledge that the use of ChatGPT in professional evaluations carries important ethical implications. LLMs cannot replace the nuanced, human-centered judgment that is essential in evaluating resident performance. That said, due to the significant time and workload constraints faculty often face, it is not always feasible to manually review large volumes of feedback. In this context, ChatGPT—although imperfect—may help identify surface-level trends that can then be reviewed more efficiently and thoughtfully by human evaluators. Moreover, ChatGPT could even been used to assess evaluators on the quality of their feedback.

Of the 776 feedback entries evaluated by ChatGPT, the distribution of domains varied significantly across the modalities for all metrics except SBP. These results further support the hypothesis that different formats of feedback emphasize distinct aspects of resident performance. Feedback entries from SIMPL-OR were likely to be encouraging, corrective, specific, and address the patient care domain. These results align with the purpose of SIMPL-OR, which is designed to deliver timely, case-specific feedback within 72 hours of observation.57 Conversely, EOR surveys focused on metrics that could be assessed over time13 —such as medical knowledge, professionalism, and ICS, providing insight into global performance. 100% or 97% (human vs ChatGPT respectively) of SIMPL-OR entries focused on patient care/technical feedback, while EOR covered a wider range of domains, suggesting both modalities are needed to provide comprehensive feedback.

Specificity of feedback also correlated to the time passed since observation, with the most specific feedback from SIMPL-OR, less specific from OSATS, and least specific from EOR feedback, consistent with previous findings.57 ChatGPT also found a similar pattern in feedback correctiveness, with SIMPL-OR most corrective, OSATS less corrective, and EOR feedback least corrective.

Finally, although OSATS are intended to be similar to SIMPL-OR feedback, meaning they are intended to assess performance after a single surgery, the quality and content of OSATS feedback differed significantly from SIMPL-OR feedback, perhaps because it was collected up to seven days after the observed event. An OSATS observation is also meant to be representative of all surgical cases of that type done over the course of a rotation, meaning raters may think about global performance rather than a single observation.

Taken together, these findings highlight the complementary strengths and limitations of each modality. Given that surgical residency applicants often prefer multimodal learning approaches,14 utilizing multiple feedback formats concurrently may better equip OHNS residents for the complexities of independent practice. These different formats of feedback also provide differing levels of insight into the various Core Competencies tracked by the ACGME.

There are important limitations to discuss. First, feedback was provided at a single training program which may not be generalizable to the overall population of OHNS residents. Since the culture of the kind of feedback given to residents can vary significantly between institutions, multi-institutional studies are needed to expand upon these initial findings.

Second, we examined only the narrative/qualitative feedback provided in each assessment instrument (and excluded the survey/Likert scale-based components of the assessments) as the narrative component is widely reported to be the most useful for residents,15,16 while simultaneously being the more difficult component to analyze. Various other studies have opined on the strengths/weaknesses of the standardized components of different feedback instruments.17 Third, since ChatGPT can lose training data after evaluating a large number of feedback entries, opening new Explore sessions for each new batch of feedbacks can limit efficiency. Furthermore, although each Explore session was pre-trained with the same information, there was still variance between each session. This is because the temperature, or the creativity ChatGPT uses when generating responses, would slightly differ, leading to minor variations in decision making. Turning this creativity down in ChatGPT is available only as an application programming interface (API), however, this requires execution of an LLM that may be more time consuming and less relevant to the goals of human evaluators. Additionally, our primary focus was to use ChatGPT as an adjunct, but not a replacement, to human raters. Therefore, future studies that are more interested in the utility of ChatGPT as a replacement to human raters should investigate the use of LLMs in this context.

Conclusion

We used an LLM to determine that the content and quality of feedback provided to OHNS residents through SIMPL-OR, OSATS, and EOR surveys differed significantly, with each modality offering unique strengths and weaknesses. A multimodal approach that incorporates all three formats concurrently may be necessary to provide the most comprehensive feedback for OHNS residents.

Supplementary Material

Supplementary Item 1

Supplementary Item 1. Prompt used for faculty members and ChatGPT to assess feedback entries. Bullet points under the practice-based learning (PBL) domain were added with guidance from faculty members (JXC and DG).

Supplementary Item 2

Supplementary Item 2. Example of a feedback in the training data document provided under the “Knowledge” section of ChatGPT.

Funding:

Supported in part by NIH grant R25 DC021243. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Footnotes

Disclosures: The authors have nothing to disclose

Data from this manuscript was presented as a poster presentation at the Triological Society Annual Meeting at COSM in New Orleans, Louisiana, USA on 5/16/25.

References

  • 1.ACGME. Overview. Accessed January 17, 2025. https://www.acgme.org/about/overview/
  • 2.Chen JX, Thorne MC, Galaiya D, Campisi P, Gray ST. Competency-based medical education in the United States: What the otolaryngologist needs to know. Laryngoscope Investig Otolaryngol. 2023;8(4):827–831. doi: 10.1002/lio2.1095 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Martin JA, Regehr G, Reznick R, et al. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg. 1997;84(2):273–278. doi: 10.1046/j.1365-2168.1997.02502.x [DOI] [PubMed] [Google Scholar]
  • 4.Tekian A, Park YS, Tilton S, et al. Competencies and Feedback on Internal Medicine Residents’ End-of-Rotation Assessments Over Time: Qualitative and Quantitative Analyses. Acad Med. 2019;94(12):1961–1969. doi: 10.1097/ACM.0000000000002821 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Chen JX, Deng F, Filimonov A, et al. Multi-institutional Study of Otolaryngology Resident Intraoperative Experiences for Key Indicator Procedures. Otolaryngol Head Neck Surg. 2022;167(2):268–273. doi: 10.1177/01945998211050350 [DOI] [PubMed] [Google Scholar]
  • 6.Chen JX, Kozin E, Bohnen J, et al. Tracking operative autonomy and performance in otolaryngology training using smartphone technology: A single institution pilot study. Laryngoscope Investigative Otolaryngology. 2019;4(6):578–586. doi: 10.1002/lio2.323 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Chen JX, Kozin E, Bohnen J, et al. Assessments of Otolaryngology Resident Operative Experiences Using Mobile Technology: A Pilot Study. Otolaryngol Head Neck Surg. 2019;161(6):939–945. doi: 10.1177/0194599819868165 [DOI] [PubMed] [Google Scholar]
  • 8.Lee RY, Kross EK, Torrence J, et al. Assessment of Natural Language Processing of Electronic Health Records to Measure Goals-of-Care Discussions as a Clinical Trial Outcome. JAMA Network Open. 2023;6(3):e231204. doi: 10.1001/jamanetworkopen.2023.1204 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Morgan DL. Exploring the Use of Artificial Intelligence for Qualitative Data Analysis: The Case of ChatGPT. International Journal of Qualitative Methods. 2023;22:16094069231211248. doi: 10.1177/16094069231211248 [DOI] [Google Scholar]
  • 10.Chen JX, Miller LE, Filimonov A, et al. Factors affecting operative autonomy and performance during otolaryngology training: A multicenter trial. Laryngoscope Investigative Otolaryngology. 2022;7(2):404. doi: 10.1002/lio2.750 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.ACGME Common Program Requirements (Residency). Published online July 1, 2023. Accessed August 29th, 2024. https://www.acgme.org/programs-and-institutions/programs/common-program-requirements/
  • 12.Exploring the ACGME Core Competencies: Patient Care and Procedural Skills (Part 3 of 7). Published online September 8, 2016. https://knowledgeplus.nejm.org/blog/patient-care-procedural-skills
  • 13.Ahle SL, Eskender M, Schuller M, et al. The Quality of Operative Performance Narrative Feedback: A Retrospective Data Comparison Between End of Rotation Evaluations and Workplace-based Assessments. Annals of Surgery. 2022;275(3):617. doi: 10.1097/SLA.0000000000003907 [DOI] [PubMed] [Google Scholar]
  • 14.Kim RH, Gilbert T. Learning style preferences of surgical residency applicants. Journal of Surgical Research. 2015;198(1):61–65. doi: 10.1016/j.jss.2015.05.021 [DOI] [PubMed] [Google Scholar]
  • 15.Figg B, Bolen O, Wagner MJ, Hicks T, Santen S. Resident Physician Interactions and Engagement With Written Assessments of Performance. Family Medicine. 2023;55(2):103–106. doi: 10.22454/FamMed.2022.587346 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ginsburg S, van der Vleuten CP, Eva KW, Lingard L. Cracking the code: residents’ interpretations of written assessment comments. Medical Education. 2017;51(4):401–410. doi: 10.1111/medu.13158 [DOI] [PubMed] [Google Scholar]
  • 17.Natesan S, Jordan J, Sheng A, et al. Feedback in Medical Education: An Evidence-based Guide to Best Practices from the Council of Residency Directors in Emergency Medicine. West J Emerg Med. 2023;24(3):479–494. doi: 10.5811/westjem.56544 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Item 1

Supplementary Item 1. Prompt used for faculty members and ChatGPT to assess feedback entries. Bullet points under the practice-based learning (PBL) domain were added with guidance from faculty members (JXC and DG).

Supplementary Item 2

Supplementary Item 2. Example of a feedback in the training data document provided under the “Knowledge” section of ChatGPT.

RESOURCES