Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Mar 1.
Published in final edited form as: Acad Med. 2023 Nov 17;99(3):285–289. doi: 10.1097/ACM.0000000000005527

Making Use of Natural Language Processing to Better Understand Medical Students’ Self-Assessment of Clinical Skills

Laurah Turner 1, Danielle E Weber 2, Sally A Santen 3, Amy L Olex 4, Pamela Baker 5, Seth Overla 6, David Shu 7, Matt Kelleher 8
PMCID: PMC10922291  NIHMSID: NIHMS1940037  PMID: 37976396

Abstract

Problem

Reflective practice is necessary for self-regulated learning. Helping medical students develop these skills can be challenging since they are difficult to observe. One common solution is to assign students reflective self-assessments, which produce large quantities of narrative assessment data. Reflective self-assessments also provide feedback to faculty regarding students’ understanding of content, reflective abilities, and areas for course improvement. To maximize student learning and feedback to faculty, reflective self-assessments must be reviewed and analyzed, activities that are often difficult for faculty due to the time-intensive and cumbersome nature of processing large quantities of narrative assessment data.

Approach

The authors collected narrative assessment data (2,224 student reflective self-assessments) from 344 medical students’ reflective self-assessments. In academic years 2019–2020 and 2021–2022, students at University of Cincinnati College of Medicine responded to 2 prompts (aspects that surprised students; areas for student improvement) after reviewing their standardized patient encounters. These free-text entries were analyzed using TopEx, an open-source natural language processing (NLP) tool, to identify common topics and themes, which faculty then reviewed.

Outcomes

TopEx expedited theme identification in students’ reflective self-assessments, unveiling 10 themes for prompt 1 such as question organization and history analysis, and 8 for prompt 2, including sensitive histories and exam efficiency. Using TopEx offered a user-friendly, time-saving analysis method without requiring complex NLP implementations. The authors discerned 4 education enhancement implications: aggregating themes for future student reflection; revising self-assessments for common improvement areas; adjusting curriculum to guide students better; and aiding faculty in providing targeted upcoming feedback

Next Steps

University of Cincinnati College of Medicine aims to refine and expand the utilization of TopEx for deeper narrative assessment analysis, while other institutions may model or extend this approach to uncover broader educational insights and drive curricular advancements.

Problem

Reflective practice helps promote self-regulated learning through identifying knowledge or performance gaps and improvement opportunities,1 crucial skills for practicing physicians.2 Making reflective practice habitual is important starting in medical school, therefore, exercises requiring student reflective self-assessment are common in medical education.

Students’ reflective self-assessments can also provide valuable information to faculty in guiding program evaluation, identifying opportunities to improve curricular experiences or the learning environment, and providing aggregate feedback to students. Large volume of narrative assessment data created by reflective self-assessments create challenges for faculty to review and identify actionable steps to improve student’s learning experience.3 Natural language processing (NLP) is a quickly evolving field that has become an invaluable tool for analyzing large quantities of free-text entries to summarize and identify trends—a systematic, scalable solution.4 NLP provides an efficient way for educators to analyze feedback to improve courses. To our knowledge, however, current scholarship on applying NLP in medical education is limited.4,5 Existing studies have employed complex NLP software, which may be challenging to adopt widely across institutions with limited resources for NLP efforts.

At the University of Cincinnati College of Medicine, first- and second-year medical students complete reflective self-assessments as part of their Clinical Skills (CS). The narrative assessment data generated could be used to help improve CS learning activities or inform curriculum development. However, course directors generally have not analyzed data derived from these exercises due to time constraints. We sought to address this problem by using the open-source online unsupervised NLP application TopEx (http://topex.cctr.vcu.edu/; accessed 5/23/2023), developed by Virginia Commonwealth University, to perform qualitative analysis of student reflective self-assessments from the CS course and demonstrate how NLP can be employed by medical education programs.

Approach

The CS course teaches history taking and the physical exam using cases with standardized/simulated patients (SPs). Students work in groups of 3 or 4 on a weekly basis to gather a history, perform a physical exam, order and interpret diagnostic data, and communicate a plan to the SPs. All encounters are recorded within CAE LearningSpaceEnterprise, software for managing simulation encounters. Each term, students review 2 of their videos and complete 2 prompted reflective self-assessments:

  • Please comment on 1–2 aspects of your performance that surprised you.

  • Write one specific thing you could do to improve in future patient interactions.

Narrative assessment data from 2 years (academic years 2019–2020 and 2021–2022) representing 2,224 student reflective self-assessments from 344 students were extracted, imported into a .csv file, and served as the corpus for this study. Figure 1 summarizes the pipeline—the end-to-end construct that orchestrates the flow of data into, and output from, an NLP model—used to analyze these free-text entries. To prep the data, free-text entries were cleaned to remove the prompt embedded within a response as well as odd characters using a prebuilt algorithm. We then imported the data for each prompt into TopEx for initial qualitative analysis. TopEx is an open-source NLP software that clusters sentences by common themes using the following 3 steps.6,7

Figure 1.

Figure 1

Natural language processing analysis pipeline of student reflective self-assessments of clinical skills, showing manual and automated steps. Narrative assessment data from academic years 2019–2020 and 2021–2022 (2,224 student writings from 344 students, University of Cincinnati College of Medicine) were extracted, manually cleaned, and entered into TopEx for automated analysis. Vector representations of a word (rows) represent distribution of use across all self-assessments (columns). The most informative 6-word phrase per sentence was identified using the TF-IDF matrix and all vectors in a phrase were averaged into a single numerical representation. Numerical sentence representations cluster similar sentences analyze each cluster to identify the top 10 most important words, which defines that cluster’s topic. See Supplemental Digital Appendix 5, at [LWW INSERT LINK], for representation of the full analysis pipeline. See Olex et al6,7 for additional detail. Abbreviation: TF-IDF: term frequency-inverse document frequency.

  • Sentence normalization removed lexical variation so downstream steps are less affected by differences in students’ writing styles. Uninformative words, referred to as stopwords, such as “the,” “and,” or “that” were removed. Supplemental Digital Appendices 1 and 2, at [LWW INSERT LINK], contain detailed methods and the custom stopword lists we used.

  • Sentence representation converted normalized sentences to a numerical vector using a term frequency-inverse document frequency8 matrix that encoded the sentence context as a numerical vector such that mathematically similar vectors represent those with similar content. Supplemental Digital Appendices 1 and 3, at [LWW INSERT LINK], contain detailed methods with a worked example.

  • Sentence analysis performed both unsupervised sentence clustering9 and topic analysis on each cluster. Grouping mathematically similar sentences created clusters of sentences that discussed a similar topic. A topic analysis was then run on each cluster using latent Dirichlet allocation (LDA),9 which returned a ranked list of the most informative and relevant words from the included sentences as the topic. Supplemental Digital Appendix 4, at [LWW INSERT LINK], contains the raw LDA topics output by TopEx for each cluster that we then manually reviewed, merged, and summarized (L.T., M.K., D.E.W., S.A.S., P.B.) to identify cohesive themes.

A detailed and technical description of the TopEx algorithm, including validation experiments with an independently coded test corpus, can be found in Olex et al.7 To provide human-validated themes, we reviewed, merged, and summarized the clusters and associated topics output by TopEx. Our author group, including medical education leadership (P.B., L.T.), faculty (S.A.S.), course directors (D.E.W., M.K.), and a student (D.S.) independently analyzed the clustered TopEx results from each prompt to provide thematic analysis. We reviewed the results and identified the main overarching theme represented by each cluster identified by TopEx individually. Next, the team and an NLP expert (all authors) met to review the aggregated themes identified by each team member, including cluster comments and themes identified. Through consensus, we agreed on thematic wording to summarize each cluster topic and determined if similar clusters should be combined under a single theme. We determined that there were no unifying themes for some very large clusters that were a catch-all for sentences that did not fit into the other topic clusters due to generic language or discussion of unique topics not mentioned by others. This behavior of TopEx is expected, as it was built to identify common themes across the input corpus; accordingly, we excluded these clusters from downstream analysis. This review process took less than an hour for each faculty.

Outcomes

Using TopEx, we identified meaningful clusters based on student reflections (Table 1) out of a total of 3,283 sentences (prompt 1, n = 1,817; prompt 2, n = 1,466). Prompt 1 had 15 clusters: 3 did not have a clear theme and 2 were merged under a single theme due to the similarity of focus. For Prompt 2, there were 13 clusters: 3 without a clear theme and 2 that were merged. In total, we removed 899 sentences.

Table 1.

Thematic Analysis of 2,224 Reflective Self-Assessments From 334 First- and Second-Year Medical Student in a Clinical Skills Course Using TopEx Natural Language Processing Analysis, University of Cincinnati School of Medicine, 2019–2020 and 2021–2022

Theme Frequency Example
Prompt 1: Please comment on 1–2 aspects of your performance that surprised you
 Organization and approach to asking questions 215 I was surprised that I did a good job asking clarifying questions if I did not fully understand what the patient was telling me.
 Analysis of history taking (sexual history, social history, etc.) 125 However, I felt this particular social history went well and I remember feeling honored that the patient was so open and honest with us when answering these questions.
 Guidance of patient and overall execution of the physical exam 100 I was surprised to see how thorough I was with making sure our patient was covered during our physical exam.
 Confidence in explaining exam maneuvers 97 When I explained the physical exam maneuvers, I felt like I gave an appropriate explanation without complicated terminology.
 Positive reflections on abilities 87 It was exciting formulating an action plan with the other students and being able to tackle the explanation as a group.
 Appearance/perception of body language 83 I did not realize taking thorough notes made me look somewhat uninterested.
 Talking/speech style and nonverbal behaviors 76 One thing I noticed is that I use my hands a lot when I talk.
 Eye contact 46 I maintain eye contact with the patient throughout the encounter, which I want to continue doing.
 Filler words 41 Additionally, I use “um” a lot while speaking, and I often say “okay” after the patient has said something.
 Time spent looking at notes and/or screens 23 I did not realize how much time I spend looking down toward the paper while the patient is speaking to me.
Prompt 2: Write one specific thing you could do to improve in future patient interactions
 Taking sensitive histories 171 I can practice phrasing social history questions to make the encounter less uncomfortable for the patient.
 Improving the efficiencies and maneuvers of the physical exam and explaining the physical exam to the patient 114 In future patient interactions, I could have a more methodical approach to the physical exam instead of changing the order each time.
 Changing the speed of speech 89 I think I can slow down my speed of talking when interacting with the patient and focus more on listening.
 Asking more open-ended questions 87 Try to leave questions more open-ended and not anticipate a patient’s answers.
 Using filler words such as “um,” “okay,” etc. 57 I can use less filler words and just listen more while the patient is talking.
 Listening to standardized patient 47 This means actively listening to the patient and not fidgeting with my badge or other things.
 Making eye contact 38 I will maintain eye contact with the patient when they are answering my questions.
 Explaining things to the patient during the physical exam and avoiding jargon 38 If I do make the mistake of using medical jargon, fully explain what the terms mean in nonmedical terminology so that the patient can understand.

There were 10 overarching themes identified for prompt 1 (aspects that surprised the student) that encompassed a focus on how students took a history, including organization or analyzing their approach to difficult portions of the history (sexual and social history). Students also commented on aspects of communication and performance of the physical examination maneuvers. These included behaviors done well as well as those for improvement.

There were 8 overarching themes identified for prompt 2 (improvements), including taking sensitive histories; efficiency of maneuvers to complete while performing the physical exam; physical aspects of the examination; specifics of communication such as speed of speaking, maintaining eye contact, or asking open-ended questions; and nonverbal communication.

Resources and time often limit the ability to review and identify trends in large volumes of narrative assessment data. Organizing and analyzing the important insights these data yield can be a daunting task. Using NLP breaks down barriers to time-consuming qualitative analyses and offers one solution to detecting useful trends in narrative assessment data for improving medical education.

Applying NLP to free-text entries from students’ reflective self-assessments improved our ability to quickly assess themes regarding their clinical skills. Reflective self-assessments are valuable to guide an individual student, yet many educators are uncertain how to use these data. Viewing narrative assessment data in aggregate and using NLP has provided unique insights to help us design curricula, guide feedback, and inform students’ observations in the future. TopEx can be used as an analytic tool for narrative assessment data, and other data from free-text entries, as it provides a user-friendly and time-efficient analysis of a large corpus of data without the need to be intimately familiar with NLP and its implementation. Manual analysis of this volume of narrative assessment data requires a substantial time commitment to attempt similar identification of themes. However, employing TopEx to analyze these data significantly reduced the time commitment, making review of these assessments feasible. In so doing, we identified 4 possibilities for how the trends in narrative assessment data can be used.

Student feedback

The organization and summary analytics of TopEx output make it possible to return aggregated data to students, who often struggle to determine if their clinical performance is on target.3 Seeing the trends and collective insights from aggregate clinical skills reflective self-assessments might allow students to view their performance differently and discover previously unconsidered areas for reflection. For example, knowing that it is common for students to fail to ask open-ended questions provides meaningful feedback for improvement. Additionally, the shared understanding of common challenges may combat some of the imposter syndrome that many trainees struggle to overcome. Likewise, social learning theory10 would suggest that seeing classmates’ insights could also encourage individual learners to seek further improvement or reflect on future performance, especially in the clinical years.

Assessment revisions

Thematic analysis may help faculty to revise the assessments they deploy to target common areas needing improvement. Our findings highlight several key areas of students’ reflective self-assessments meriting further exploration. For instance, we could have anticipated that students would comment on their history-gathering but not on which specific aspects required more significant attention. Students commented on their organization of questions, ability to address sensitive topics, and use of open-ended questions. A deeper appreciation for students’ reflective self-assessments might help course directors be more precise in the assessments they create to better capture students’ developmental trajectory. In addition, future reflective self-assessment prompts may also benefit from revisions to promote deeper insights.

Curricular improvement

Using TopEx to analyze data from free-text entries allows educators to explore topics for curricular improvement by addressing those specific challenges students identify. For instance, if a significant proportion of students are surprised by their body language, there is likely potential for curricular content to address this topic. The benefits of this approach are amplified because it impacts both the students who have identified these areas needing improvement and those who may have underrecognized the same growth opportunities.

Teaching tools

TopEx organizes data from free-text entries for accessible analysis. Within each cluster of text exist countless examples from individual students’ reflective self-assessments. We have found that organizing the original text this way permits a deeper dive into specific examples. For instance, we know that students often struggle with using filler words; seeing individual responses has allowed us to become more sensitized to observing their use and thus more effective in teaching SPs and volunteer faculty to recognize and give targeted feedback to students.

Unfamiliarity with NLP and the general lack of user-friendly NLP applications raise barriers to its adoption in academic medicine. While the use of TopEx helps to overcome these barriers, NLP tools, similar to all tools, have limitations based on what they are designed to do and should be used with appropriate preparation and knowledge. For instance, when word clusters do not have a cohesive theme, it does not necessarily mean they contain no important observations. TopEx is meant to assist in the quick exploration of a corpus to identify main overarching themes and is not designed to fully replace human analysis or identify every topic discussed in that corpus. It is recommended for a human to manually review the results of TopEx, or any other NLP tool, to ensure applicability to the task at hand. Thus, it is important to understand what a tool is designed to do, even at a high level, so that results are interpreted and used within those bounds.

Next Steps

Employing TopEx has shown promise in enhancing narrative assessment analysis at our institution. We plan to further leverage TopEx by exploring its additional features and potentially integrating it with other analytical tools to gain deeper insights into students’ reflective narratives.

TopEx’s user-friendly and time-efficient nature makes it a viable option for other institutions aiming to analyze narrative assessment data. Further, the adaptability of TopEx could be explored to meet specific analytical needs of different medical education settings or to analyze other qualitative data forms. Through iterative feedback and improvements, TopEx could significantly contribute to advancing medical education.

Supplementary Material

Supplemental Digital Content

Funding and Support:

Work performed by A.L. Olex was supported by Clinical and Translational Science Award no. UL1TR002649 from the National Center for Advancing Translational Sciences. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the National Center for Advancing Translational Sciences or the National Institutes of Health.

Other disclosures:

The University of Cincinnati College of Medicine receives funding for S.A. Santen as a consultant (for her work outside of this study) from the American Medical Association and also funding for M. Kelleher as a consultant from the National Board of Medical Examiners.

Footnotes

Ethical approval: The data collection and analysis of this program evaluation were deemed not human subjects by the Institutional Review Board (1/19/2023, MOD01_2021–1032).

Previous presentations: Part of this work was presented as a poster at the Association of American Medical Colleges annual Learn Serve Lead Conference, Nashville, Tennessee, November 11–15, 2022.

Data: All data were obtained internally.

Supplemental digital content for this article is available at [LWW INSERT LINK].

Contributor Information

Laurah Turner, Department of Medical Education, University of Cincinnati College of Medicine, Cincinnati, Ohio.

Danielle E. Weber, Departments of Pediatrics and Internal Medicine, Cincinnati Children’s Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati, Ohio.

Sally A. Santen, Virginia Commonwealth University School of Medicine, Richmond, Virginia, and professor of emergency medicine and medical education, University of Cincinnati College of Medicine, Cincinnati, Ohio.

Amy L. Olex, Virginia Commonwealth University, Wright Center for Clinical and Translational Research, Richmond, Virginia.

Pamela Baker, University of Cincinnati College of Medicine, Cincinnati, Ohio..

Seth Overla, Office of Medical Education, University of Cincinnati, Cincinnati, Ohio..

David Shu, University of Cincinnati College of Medicine, Cincinnati, Ohio.

Matt Kelleher, Department of Pediatrics and Internal Medicine, Cincinnati Children’s Hospital Medical Center and University of Cincinnati College of Medicine, Cincinnati, Ohio.

References

  • 1.Cutrer WB, Miller B, Pusic MV, et al. Fostering the development of master adaptive learners: A conceptual model to guide skill acquisition in medical education. Acad Med. 2017;92:70–75. [DOI] [PubMed] [Google Scholar]
  • 2.Sandars J. The use of reflection in medical education: AMEE guide no. 44. Med Teach. 2009;31:685–695. [DOI] [PubMed] [Google Scholar]
  • 3.Koole S, Dornan T, Aper L, et al. Factors confounding the assessment of reflection: A critical review. BMC Med Educ. 2011;11:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Chary M, Parikh S, Manini AF, Boyer EW, Radeos M. A review of natural language processing in medical education. West J Emerg Med. 2019;20:78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sjoding MW, Liu VX. Can you read me now? Unlocking narrative data with natural language processing. Ann Am Thorac. 2016;13:1443–1445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Olex AL, French E, Burdette P, et al. TopEx: Topic exploration of COVID-19 corpora: Results from the BioCreative VII challenge track 4. Database (Oxford). 2022;2022:baac063. Doi: 10.1093/database/baac063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Olex AL, DiazGranados D, McInnes BT, Goldberg S. Local topic mining for reflective medical writing. AMIA Summits on Translational Science Proceedings. 2020;2020:459. [PMC free article] [PubMed] [Google Scholar]
  • 8.Roelleke T, Wang J. TF-IDF uncovered: A study of theories and probabilities. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 435–442. Association for Computing Machinery: 2008. https://dl.acm.org/doi/abs/10.1145/1390334.1390409. Accessed October 2, 2023. [Google Scholar]
  • 9.Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Machine Learn Res. 2003;3:993–1022. [Google Scholar]
  • 10.Bandura A, Walters RH. Social learning theory. Vol 1. Englewood Cliffs, NJ: Prentice Hall; 1977. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Digital Content

RESOURCES