Skip to main content
Journal of General Internal Medicine logoLink to Journal of General Internal Medicine
. 2024 Oct 14;40(1):127–134. doi: 10.1007/s11606-024-09050-9

Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education

Radhika Sreedhar 1,, Linda Chang 1, Ananya Gangopadhyaya 1, Peggy Woziwodzki Shiels 1, Julie Loza 1, Euna Chi 1, Elizabeth Gabel 1, Yoon Soo Park 1
PMCID: PMC11780228  PMID: 39402411

Abstract

Background

The Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty time. As large language models (LLMs) can score and generate feedback, we explored their use in grading formative assessments through validity and feasibility lenses.

Objective

To explore the consistency and feasibility of using an LLM to assess and provide feedback for formative assessments in undergraduate medical education.

Design and Participants

This was a cross-sectional study of pre-clinical students’ critical appraisal assignments at University of Illinois College of Medicine (UICOM) during the 2022–2023 academic year.

Intervention

An initial sample of ten assignments was used to develop a prompt. For each student entry, the de-identified assignment and prompt were provided to ChatGPT 3.5, and its scoring was compared to the existing faculty grade.

Main Measures

Differences in scoring of individual items between ChatGPT and faculty were assessed. Scoring consistency using inter-rater reliability (IRR) was calculated as percent exact agreement. Chi-squared test was used to determine if there were significant differences in scores. Psychometric characteristics including internal-consistency reliability, area under precision-recall curve (AUCPR), and cost were studied.

Key Results

In this cross-sectional study, 111 pre-clinical students’ faculty graded assignments were compared with that of ChatGPT and the scoring of individual items was comparable. The overall agreement between ChatGPT and faculty was 67% (OR = 2.53, P < 0.001); mean AUCPR was 0.69 (range 0.61–0.76). Internal-consistency reliability of ChatGPT was 0.64 and its use resulted in a fivefold reduction in faculty time, and potential savings of 150 faculty hours.

Conclusions

This study of psychometric characteristics of ChatGPT demonstrates the potential role for LLMs to assist faculty in assessing and providing feedback for formative assignments.

Supplementary Information

The online version contains supplementary material available at 10.1007/s11606-024-09050-9.

KEY WORDS: ChatGPT scoring consistency formative assignments

INTRODUCTION

Background

The Liaison Committee on Medical Education (LCME) requires faculty to provide individualized feedback on students’ self-directed learning (SDL) skills.1,2 Currently, undergraduate medical education (UME) relies on selected responses (multiple choice questions) or checklists with lesser use of constructed-response assessments.3 Constructed responses are those that require a student to produce an answer ranging from a couple of sentences to several paragraphs.4 Evaluation of constructed responses requires faculty development, time and calibration.5 Studies have demonstrated the utility of natural language processing (NLP) and machine learning (ML) in the assessment of constructed responses.4 The evaluation of individual constructed responses to critical appraisal of scientific literature chosen by the students is complex and we explore the use of artificial intelligence for this assessment.

Role of Artificial Intelligence (AI) in UME Assessment

A systematic review of the use of AI for assessment highlights its role in grading student responses and formative assessments.6,7 Anderson et al. validated an electronic scoring system of students’ clinical notes to provide reports of students’ experiences.8 Mirchi et al. created a Virtual Operative Assistant to classify students based on performance benchmarks.9 Saplacan et al., in their study, suggested that digital feedback should be designed to arouse positive emotions.10 These studies were done before the emergence of LLMs.

Emergence of Large Language Models and Generative AI

LLMs are mathematical models of the statistical distribution of words in publicly available human-generated text.11 For example, if we give the prompt “the antibiotic used to treat sinusitis is” and the response is “amoxicillin,” it is important to recognize what we are asking is “based on the statistical distribution of words in publicly available text, what words are most likely to follow the sequence ‘the antibiotic used to treat sinusitis is?’” A good reply to this question is “amoxicillin.”6

LLMs can use a dialogue-based format to analyze written responses,12 but unlike NLP and ML, they do not require significant effort or expertise to achieve desired outcomes.13,14 Given the ease of use of LLM, this report focuses on the potential use of one LLM — ChatGPT — to score constructed response critical appraisal assignments.

Chatbot Generative Pre-trained Transformer (ChatGPT)

OpenAI released GPT-3, a language processing AI model, and ChatGPT, a computer application using ML that simulated conversations by responding to keywords or sentences.15,16 Generative AI refers to a neural network (NN) ML model that generates text and other content using data they were trained on.17 Pre-training means that the language model is first trained on a large dataset before being fine-tuned on a specific task.18 Deep learning (DL) is a subset of ML that uses NN to imitate human intelligence by automatically learning and extracting data features.19 A transformer is a type of DL model that processes sequential data such as language modeling, and weighs the significance of the input and output sequences, enabling it to effectively generate coherent text.20

ChatGPT is increasingly being used by students for learning and to answer clinical questions.21 The NLP capabilities of ChatGPT enable it to automate time-intensive tasks, like summarizing and evaluating written text.22 However, there are few studies specifically exploring the use of ChatGPT to score and provide feedback for critical appraisal assignments.23,24 Rubric-based assessments by ChatGPT can indicate if the student meets the bare minimum, allowing faculty to spend more time providing feedback for students’ critical appraisal skills. We explore this as a potential solution to the challenges of scoring and providing feedback for formative constructed response critical appraisal assignments.

Objectives

We compare the accuracy, consistency, and feasibility of using ChatGPT versus faculty to score and provide feedback for formative critical appraisal assignments. Findings from this study will help us determine whether ChatGPT can be used in this manner adding to the validity and reliability arguments of using LLMs to facilitate formative assessment.25

METHODS

Design, Setting, Participants

The institutional review board of the University of Illinois Chicago approved this study under the exempt research determination. The study follows the STROBE reporting guideline for observational studies. This was a multisite cross-sectional study of assessment of critical appraisal assignments completed by preclinical medical students at Chicago, Peoria, and Rockford campuses of UICOM in the academic year 2022–2023.

Data Sources

Self-directed Learning Critical Appraisal Assignment

As a part of this assignment (supplement 1), students applied concepts learned in their Evidence-Based Medicine (EBM) coursework. They are asked to develop a foreground question (specific question on a diagnostic or therapeutic patient problem) based on a standardized patient encounter.26 Students answered this question by acquiring a primary source in the literature and providing short constructed responses to questions regarding its strengths and weaknesses. They indicated how the results of the article they chose applied to their patient.

The assignments used had already been scored by trained faculty using a scoring rubric shown in Table 1. The faculty scoring indicated whether students met expectations for four aspects of EPA7 and provided written feedback. The rubric comprised five analytic checklist items (Table 1). In an internal validation study, pairwise agreement between raters using this rubric had a weighted kappa ranging between 0.42 and 0.56 indicating moderate agreement.

Table 1.

Self-directed Learning Critical Appraisal Assignment Grading Rubric

EPA 7 Does not meet expectations Meets expectations

Self-identified learning need (SDL)

Question 1

Learning needs related to the patient is not clearly described Learning need related to the patient is clearly described

PICO (ASK)

Questions 2 and 3

Clinical question is missing PICO components without explanation Clinical question has all the PICO components

Resource (ACQUIRE)

Question 4

Resource is not a primary source Resource is a primary source

Strengths and weaknesses (APPRAISE)

Question 5

Student does not describe the strengths and weaknesses of the article Student describes the strengths and weaknesses of the article

Strengths and weaknesses (APPRAISE)

Question 5

Student does not describe the strengths and weaknesses of the article Student describes the strengths and weaknesses of the article

Answer (APPLY)

Question 6

Student’s summary of how the study addresses the patient’s problem is not explained Student’s summary of how the study addresses the patient’s problem is explained

Preclinical students complete six such formative assignments longitudinally, with individual faculty feedback. These assignments fulfilled element 6.3 of the LCME accreditation standards on SDL1 and were designed to ensure the spaced repetition of the ask, acquire, appraise, and advice principles as identified by Entrustable Professional Activity (EPA) 727 (see eTable 1).

Selection of Assignments

To meet study goals that optimize exploring consistency and feasibility of scoring between ChatGPT and faculty assessors, we strategically sampled assessments that reflected variability across the scoring category (rather than the natural distribution of assessments). Study authors were asked to identify 10–15 assignments where students did not meet at least one criterion on the grading rubric (see Table 1). We chose a sample size of 111 assignments to reflect scoring distribution across categories. Using distribution comparisons between the subsample and the total distribution, we found that this sample size adequately reflected assessments to estimate measures of consistency in our psychometric analysis.28

Training and Scoring Written Assessment Using ChatGPT

Prompt Development

An initial sample set of 10 student assignments (separate from the selected 111 assignments) was used to develop the prompt. The prompt mirrored faculty instructions for grading the assignment. The following prompts were used to train the model:

  1. Please grade the write-up below

  2. Grade the assignment below based on AAMC’s EPA 7 and indicate the standards you used for grading.

  3. Please grade this student’s write-up

  4. Grade the student’s write-up below using a letter grade.

  5. Please grade the assignment below using the following Assessment Categories.

The prompt shown in Supplement 3 was chosen as it consistently elicited appropriate responses from ChatGPT 3.5 for the checklist items.29

Data Submission to ChatGPT and Scoring

Each selected assignment was provided a unique identifier. For each student entry, the de-identified assignment, scoring rubric, and prompt were provided to ChatGPT 3.5 in the same chat thread in April 2023. The scores on individual checklist items for each assignment provided by ChatGPT and faculty were entered in an Excel spreadsheet.

We compared the scores generated by ChatGPT with those of faculty. The outcomes assessed included (1) descriptive statistics (% responses at the item level and across items); (2) psychometric characteristics (internal consistency reliability; inter-rater reliability); and (3) cost. Feedback for written responses was measured using analytic (five items) and holistic (global) measures as shown in Table 1. We asked ChatGPT to provide a letter grade and justification for the grade to capture the essence of feedback but did not use it as the faculty did not provide a grade and we did not provide specific instructions to ChatGPT as to how to assign the letter grade. See Figure 1 for flow diagram.

Figure 1.

Figure 1

Study flow.

Statistical Methods

We examined scoring consistency using inter-rater reliability (IRR), calculated as percent exact agreement and kappa for each item (dichotomously scored as 0 = not present, 1 = present) and a summary score of all items. Exact agreement takes the crude proportion of agreement between raters, while kappa corrects for chance agreement. Chi-squared test was used to determine if there was a significant difference in scores between faculty and ChatGPT; logistic regression was used to examine the odds of ChatGPT predicting faculty assessor rating. For the sensitivity of ChatGPT to distinguish between performance, relative to precision (ability to not mark negative as positive) and recall (ability to identify positive performance), we estimated the area under the precision-recall curve (AUCPR).

An alpha level of 0.05 was used to determine statistical significance, and internal consistency reliability was determined using Cronbach’s alpha. Item-level data was dichotomous and reported as number and percent. The conversion factor for cost feasibility analysis was based on work done by Yudkowsky et al.30 assuming the costs of faculty grading at $150/h faculty time and about $29/h graduate teaching assistant (TA) time. The faculty time was the time it took for faculty to grade the assignment and was self-reported. Data compilation and analyses were conducted using Stata 18 MP (College Station, TX).31

RESULTS

Participants

Data for this study used SDL critical appraisal assignments from 111 students, across three sites (Chicago, Peoria, and Rockford). This represents 18.5% of the 600 assignments completed by April 2023 were graded by six different faculty graders.

Outcome Data

Overall, the scoring of individual items between ChatGPT and faculty raters was comparable, as shown in Table 2. There were no significant differences between the proportion of responses scored by ChatGPT and faculty assessors, all P > 0.05. Aggregating across items, ChatGPT and faculty assessors had rating scores of 87% and 70%, respectively. Mean AUCPR was 0.69 (range between items 0.61 and 0.76). The overall agreement was 67%; ChatGPT results aligned with faculty assessors (OR = 2.53, P < 0.001).

Table 2.

Descriptive Statistics and Percent Exact Agreement Between ChatGPT and Faculty Scoring (No. = 111)

checklist item Chat GPT scoring
No. (%) of students who met criteria
Faculty scoring
No. (%) of students who met criteria
% exact agreement between ChatGPT and faculty scores P-value Item discrimination
Is the learning need related to the patient clearly described? 99 (89%) 88 (79%) 74% 0.70 .38
Does the clinical question have all the PICO components? 106 (95%) 93 (84%) 81% 0.81 .50
Is the citation that was used primary literature? 104 (94%) 63 (57%) 60% 0.12 .25
Does the student describe the strengths and weaknesses of the article cited? 88 (79%) 81 (73%) 63% 0.91 .35
Does the student summary explain how the study cited addresses the patient’s problem? 86 (77%) 66 (59%) 59% 0.39 .54

P-values based on chi-squared test and are non-significant for all items. Overall exact agreement between ChatGPT and faculty is 67%, OR = 2.53, P < 0.001. Internal-consistency reliability Cronbach’s alpha is 0.64

At the item level, agreement between ChatGPT and faculty raters ranged between 59 and 81%. The overall weighted kappa—the proportion of agreement between Chat GPT and faculty taking into account chance agreement—was 0.17. The mean inter-item correlations for ChatGPT and faculty were 0.27 and 0.13, respectively. Discrimination between Faculty and ChatGPT scores was highest for items assessing the “ASK” and “APPLY” components of EPA 7. All items had a discrimination of over 20%. Internal-consistency reliability (Cronbach’s alpha) of ChatGPT was 0.64 demonstrating good reproducibility for formative assessments.

Most students met the item on the “ASK” criteria for EPA 7 on ChatGPT and faculty grading. The agreement between ChatGPT and faculty was lowest for EPA 7 component “APPLY.” ChatGPT did not properly identify when study content did not apply to the patient reflecting the findings in the literature that its responses may lack context.32 The second most common area of disagreement in ratings was in determining if the citation used was primary literature.

The feedback provided by ChatGPT was not as specific as faculty but addressed most areas of the assignment. It was more positive and effusive than the faculty feedback. Examples of feedback where there was a discrepancy are shown in Table 3.

Table 3.

Selected Discordant Feedback Provided by ChatGPT and Faculty for Preclinical Medical Student’s Critical Appraisal Assignments

ChatGPT feedback Faculty feedback
Good job on the critical appraisal of the article. You clearly identified the learning needs related to the patient, formulated a well-structured clinical question, and provided a succinct summary of the study. You also discussed the strengths and weaknesses of the source, and how the information you found is applicable to your patient’s problem Overall, you did a nice job. I would encourage you to continue working on focusing your PICO question to be as specific as possible, so that you can learn to evaluate specific interventions and make clinical decisions for your patient based on what you find. The study you selected here seems applicable; I would have liked a few more details about the specific intervention(s) they used to help decide if it's something you could offer to your patient
Overall, the student did a good job of critically appraising the article and summarizing the information found in a way that could be presented to a preceptor. The student's decision to use this therapy for their patient is justified by the information presented in the article This study, as described in the abstract, is a review. The abstract does not present any data at all, or so I can’t tell if it is really a review or is expert opinion. In any event, you cannot tell your patient why you believe changing her nutrition will answer any question she has. For that reason, it does not serve the purposes of this exercise. I agree that evidence of changing nutrition would help skin or joint problems with psoriasis would be useful for this patient, but I do not see such evidence. Remember, this exercise is to learn to find primary evidence that would help in your care of a specific patient: such evidence has new data
The student did a good job of identifying a strength and weakness of the article cited. However, more detail could have been provided in the appraisal of the study's methodology and validity. G. B + . The student effectively answered the questions and provided a relevant primary resource. However, there is room for improvement in the critical appraisal of the article cited Remember that reviews are not primary sources, although they are very helpful. You could have picked one of the RCTs of methotrexate vs DMARD and that would have been perfect. Also, I'm not seeing where your conclusions are compared to the standard of care. Please include a sentence/addend the last sentence to address how the article helps answer your PICO question
I would give the student an A grade for their critical appraisal of the article cited. The student provides a clear and concise summary of their learning need, constructs a well-formed PICO question, and finds a primary source that addresses the patient's problem. The student also appraises the article effectively, highlighting strengths and weaknesses and providing justification for their decision not to use the therapy for their patient In reviewing your assignment, your question is not related to your patient case, the article you chose was a review. Additionally, your strengths and weaknesses are statements but do not talk about how the authors made these statements. Please review and return it to me
The student did an excellent job of formulating a clear and specific PICO question related to their patient encounter. The chosen citation is a primary source and addresses the question well, providing a meta-analysis of methotrexate use in psoriasis. The student's appraisal of the article is also well done, discussing strengths and weaknesses not highlighted by the authors. However, the student could have expanded more on how the study addresses the patient's problem, specifically addressing the question of whether methotrexate is better than biologic therapy for symptom alleviation

Meta-analyses are not primary sources unfortunately

More details in the explanation are needed

When explaining the answers, make sure to state the results from the study and the facts from the patient situation to tie those in together

Based on faculty feedback, it takes an average of 5–10 min (median 7 min) for trained faculty raters to complete the checklist and provide narrative feedback. This time accounts for the time faculty spend scoring the assignment, and crafting the feedback. ChatGPT takes less than a minute to provide a response and an additional minute was added to input the data. Based on these estimates, the use of ChatGPT led to a fivefold reduction in faculty scoring time from an estimated average of 7 to 2 min/assignment resulting in a saving of 5 min/assignment. There are a total of 1800 such assignments for a cohort of preclinical students at UICOM. Using these parameters, we estimate that this will result in a potential cost savings of $29,760 assuming that the costs of faculty grading are $150/h faculty time and about $29 graduate teaching assistant (TA) time as shown in Table 4.

Table 4.

Potential Annual Cost Savings with Use of a Teaching Assistant (TA) with ChatGPT to Help Faculty Grade Assignments

Grading time per assignment Cost in hours per cohort of students Cost in dollars per cohort of students
Faculty alone 7 min

210 h

7 min × 1800 assessments

$31,500

210 h × $150/h

TA submits student assignments and checklist to obtain ChatGPT grades 2 min

60 h

2 min × 1800 assessments

$1,740

60 h × $29/h

Cost savings to implement TA and ChatGPT 5 min 150 h $29,760

This is based on each cohort having 300 students and 6 assignments, accounting for 1800 total assignments per cohort

DISCUSSION

In this cross-sectional study of formative critical appraisal assignments, the scoring of individual items by ChatGPT and faculty was comparable with an overall agreement of assessment of 67% and a potential savings of faculty time.

“A person working in partnership with an information resource is ‘better’ than that same person unassisted.”33 The overall agreement of assessment and time savings enabled by using ChatGPT suggests that this partnership can be used to provide more opportunities for feedback on this formative assignment.

The strengths of constructed response assignments include authenticity, the educational effect of deeper learning, and the catalytic effect of feedback. However, its drawback is feasibility.34 In the last few years, there have been substantial advances in the application of NLP techniques to assess student-constructed responses, but few studies are harnessing the power of LLMs in this area.35 This study is unique in its comparison of LLMs with faculty in scoring and providing feedback for formative assessments of critical appraisal assignments.

Prior studies of systems used to assess constructed responses have shown that many of these systems are as reliable as human raters.36 Studies have also shown that the criteria and accuracy of human essay graders are limited.37 Although LLMs’ performance in this study was not as dramatic as prior studies, the reliability is acceptable for formative assessments.

OpenAI collects all data that is input into ChatGPT for model training. The use of ChatGPT for the actual scoring of assignments without deidentifying them can have privacy implications and should be addressed before it is used.38 Bias in AI algorithms affecting the feedback, and student and faculty acceptance of LLM in grading are other factors that should be considered.

Interpretation

The advantages of using ChatGPT to score formative assignments include greater availability, rapid turnaround time, and lower cost per assignment. With an overall exact agreement of 67% and AUCPR of 0.69, ChatGPT may be appropriate for scoring formative assessments. AI raters using checklists and scoring rules can provide significant assistance to faculty raters for formative assessments just as non-clinician raters do currently.39

Generalizability

Once the appropriate prompt has been identified, the steps to input the prompt and assignment to ChatGPT and get the scores on the grading checklist are simple. As multiple authors with written instructions were able to use ChatGPT to score assignments without difficulty, we are confident that these results can be generalized. This study was conducted at a single medical school and results may not apply to other schools which have different curricula. However, as all schools must address LCME element 6.3, the use of ChatGPT for scoring such formative assignments can be useful. Future work may require larger studies with more heterogeneous learner populations and learning contexts to determine the continued efficacy of the use of LLMs to enhance education through formative assessments.

Limitations

Challenges seen in the literature regarding the use of AI to grade constructed responses include errors in the interpretation of student content, misclassification, and ignoring student content.4 The use of ChatGPT in this assessment similarly demonstrated instances where student content was ignored or misinterpreted.

The creation and optimization of prompts to elicit desired responses from LLMs is called prompt engineering.40 The inability of ChatGPT to differentiate between primary and secondary literature in this study may be due to a lack of prompt specificity. Newer literature on prompts for grading may help obtain better results.41,42.

A systematic review of automated essay scoring observed that the essay evaluation is not done based on the relevance of the content and this is also seen with LLMs.43 ChatGPT had difficulty discerning the applicability of study results to patients in our study. One of the fundamental tenets of EBM is ensuring that the evidence applies to the patient. This issue can be addressed by having human graders score some of the formative assignments.

Assignment scoring by ChatGPT was different when using the same assignment with the same prompt at a different time. This is a limitation of any neural network–powered construct and affects its external validity, but is less of a concern for formative assessments. Biases in the training datasets may be reflected in the responses it generates and are a flaw to be kept in mind.

The study used selected assignments that did not meet all criteria on the grading checklist to identify how LLMs performed in identifying major deficiencies in competency. Future studies should be designed to include the natural distribution of data and use a gold standard for comparison to reduce confounding. A larger and more diverse sample, including multiple institutions, and a range of assignment types would likely provide more robust results.

The calculations used for cost savings are based on rates found in the literature. The time savings are based on self-reported time faculty spent on manual assessments which are prone to error.

These rates are also based on the use of ChatGPT 3.5 which is free for the public at the present time.44 However, this may not be the case for long given that OpenAI has introduced versions requiring a subscription fee. Given the difficulty in distinguishing between primary and secondary article sources, the cost savings by LLMs would likely be reduced. As more institutions explore creating inhouse LLMs or licensing LLMs for use, a detailed cost–benefit analysis once these are established will be needed.

CONCLUSIONS

The advantages of using ChatGPT to score formative assignments include greater availability, rapid turnaround time, and lower cost per assignment. With an overall exact agreement of 67%, ChatGPT may be appropriate for scoring formative assessments. AI raters using checklists and scoring rules can provide significant assistance to faculty raters for formative assessments just as non-clinician raters do currently.39

This cross-sectional study of the psychometric characteristics of ChatGPT scoring of SDL critical appraisal assignments as compared to faculty demonstrates that the validity is acceptable for scoring and providing feedback for formative assessments.

Supplementary Information

Below is the link to the electronic supplementary material.

Acknowledgements:

We used ChatGPT 3.5 by OpenAI as a part of the formal research design to grade assignments as compared to faculty graders.

Data Availability

Data will be provided on request.

Declarations:

Conflict of Interest:

The authors declare that they do not have a conflict of interest.

Footnotes

Prior Presentations: AMEE Glasgow 2023.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Liaison Committee on Medical Education. Standards, Publications and Notification Forms. Available at https://lcme.org/publications/. Accessed on 24 Feb 2024.
  • 2.Papanagnou D, Corliss S, Richards JB, Artino AR Jr, Schwartzstein R. Progression of self-directed learning in health professions education: Clarifying terms and processes. Acad Med. 2024;99(2):236. 10.1097/ACM.0000000000005191. [DOI] [PubMed] [Google Scholar]
  • 3.Van Wijk EV, Janse RJ, Ruijter BN, et al. Use of very short answer questions compared to multiple choice questions in undergraduate medical students: An external validation study. PLoS One. 2023;18(7): e0288558. 10.1371/journal.pone.0288558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Magliano JP, Graesser AC. Computer-based assessment of student-constructed responses. Behav Res Methods. 2012;44(3):608-621. 10.3758/s13428-012-0211-3. [DOI] [PubMed] [Google Scholar]
  • 5.Hauer KE, Boscardin C, Brenner JM, van Schaik SM, Papp KK. Twelve tips for assessing medical knowledge with open-ended questions: Designing constructed response examinations in medical education. Med Teach. 2019;42(8):880-885. 10.1080/0142159x.2019.1629404. [DOI] [PubMed] [Google Scholar]
  • 6.González-Calatayud V, Prendes-Espinosa P, Roig-Vila R. Artificial intelligence for student assessment: A systematic review. Appl Sci. 2021;11(12):5467. 10.3390/app11125467. [Google Scholar]
  • 7.Chen YK, Wrenn JO, Xu H, et al. Automated Assessment of Medical Students’ Clinical Exposures according to AAMC Geriatric Competencies. PubMed. 2014; 2014:375-384. [PMC free article] [PubMed] [Google Scholar]
  • 8.Spickard A, Ridinger H, Wrenn J, et al. Automatic scoring of medical students’ clinical notes to monitor learning in the workplace. Med Teach. 2013;36(1):68-72. 10.3109/0142159x.2013.849801. [DOI] [PubMed] [Google Scholar]
  • 9.Mirchi N, Bissonnette V, Yilmaz R, Ledwos N, Winkler-Schwartz A, Del Maestro RF. The Virtual Operative Assistant: An explainable artificial intelligence tool for simulation-based training in surgery and medicine. Pławiak P, ed. PLOS ONE. 2020;15(2): e0229596. 10.1371/journal.pone.0229596. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Saplacan D, Herstad J, Pajalic Z. Feedback from digital systems used in higher education: An inquiry into triggered emotions two universal design-oriented solutions for a better user experience. In Transforming Our World through Design, Diversity and Education: Proceedings of Universal Design and Higher Education in Transformation Congress 2018; In: Proceedings of Universal Design and Higher Education in Transformation Congress 2018. Vol 256. pp. 421–430. IOS Press. [PubMed]
  • 11.Shanahan M. Talking About Large Language Models. arXiv (Cornell University). Published online December 7, 2022. 10.48550/arxiv.2212.03551.
  • 12.Gardner J, O’Leary M, Yuan L. Artificial intelligence in educational assessment: “Breakthrough? Or buncombe and ballyhoo?” J Comput Assist Learn. 2021;37(5):1207-1216. 10.1111/jcal.12577. [Google Scholar]
  • 13.Nur M, Arief Ramadhan, Hendric L. Automatic essay exam scoring system: a systematic literature review. Procedia Comput Sci. 2023; 216:531-538. 10.1016/j.procs.2022.12.166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Hussein MA, Hassan H, Nassef M. Automated language essay scoring systems: a literature review. PeerJ Comput Sci. 2019;5: e208. 10.7717/peerj-cs.208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Altmäe S, Sola-Leyva A, Salumets A. Artificial intelligence in scientific writing: a friend or a foe? Reproductive Biomedicine Online. Published online April 1, 2023. 10.1016/j.rbmo.2023.04.009. [DOI] [PubMed]
  • 16.Open AI. ChatGPT: optimizing language models for dialogue. Open AI. Published November 30, 2022. https://openai.com/blog/chatgpt/. Accessed 1 Feb 2023.
  • 17.Martineau K. What is generative AI? IBM Research Blog. Published February 9, 2021. https://research.ibm.com/blog/what-is-generative-AI.
  • 18.Kasneci E, Sessler K, Küchemann S, et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences. Sci Direct. 2023;103(102274). 10.1016/j.lindif.2023.102274.
  • 19.Mackenzie SC, Sainsbury CAR, Wake DJ. Diabetes and artificial intelligence beyond the closed loop: a review of the landscape, promise and challenges. Diabetologia. 2024;67(2):223-235. 10.1007/s00125-023-06038-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Sanmarchi F, Bucci AF, Nuzzolese AG, et al. A step-by-step researcher’s guide to the use of an AI-based transformer in epidemiology: an exploratory analysis of ChatGPT using the STROBE checklist for observational studies. J Public Health. Published online May 26, 2023. 10.1007/s10389-023-01936-y. [DOI] [PMC free article] [PubMed]
  • 21.Grabb D. ChatGPT in Medical Education: A Paradigm Shift or a Dangerous Tool? Acad Psychiatr. 2023;47(4):439-440. 10.1007/s40596-023-01791-9. [DOI] [PubMed] [Google Scholar]
  • 22.Lee H. The Rise of ChatGPT: Exploring its potential in medical education. Anat Sci Educ. Published online March 14, 2023. 10.1002/ase.2270.
  • 23.Mohammad B, Turjana Supti, Mahmood Alzubaidi, et al. The pros and cons of using ChatGPT in medical education: A scoping review. Published online June 29, 2023. 10.3233/shti230580. [DOI] [PubMed]
  • 24.Denny JC, Spickard A, Speltz PJ, Porier R, Rosenstiel DE, Powers JS. Using natural language processing to provide personalized learning opportunities from trainee clinical notes. J Biomed Inform. 2015; 56:292-299. 10.1016/j.jbi.2015.06.004. [DOI] [PubMed] [Google Scholar]
  • 25.Yudkowsky R, Yoon-Soo Park, Downing SM. Assessment in Health Professions Education. Routledge, New York, NY; 2020. [Google Scholar]
  • 26.Seguin A, Haynes RB, Carballo S, Iorio A, Perrier A, Agoritsas T. Translating clinical questions by physicians into searchable queries: Analytical survey study. JMIR Med Educ. 2020;6(1): e16777. 10.2196/16777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Core EPA Publications and Presentations. AAMC. Available at https://www.aamc.org/what-we-do/mission-areas/medical-education/cbme/core-epas/publications. Accessed 24 Feb 2024.
  • 28.Park YS, Hyderi A, Bordage G, Xing K, Yudkowsky R. Inter-rater reliability, and generalizability of patient note scores using a scoring rubric based on the USMLE Step-2 CS format. Adv Health Sci Educ Theory Pract. 2016;21(4):761-73. 10.1007/s10459-015-9664-3. [DOI] [PubMed] [Google Scholar]
  • 29.Prompting AI chatbots. Available at: https://cte.ku.edu/prompting-ai-chatbots. Accessed 10 November 2023.
  • 30.Yudkowsky R, Hyderi A, Holden J, et al. Can nonclinician raters be trained to assess clinical reasoning in postencounter patient notes? Acad Med. 2019;94: S21-S27. 10.1097/acm.0000000000002904. [DOI] [PubMed] [Google Scholar]
  • 31.StataCorp. 2023. Stata Statistical Software: Release 18. College Station, TX: StataCorp LLC. [Google Scholar]
  • 32.Temsah O, Khan SA, Yazan Chaiah, et al. Overview of early ChatGPT’s presence in medical literature: Insights from a hybrid literature review by ChatGPT and Human Experts. Cureus. Published online April 8, 2023. 10.7759/cureus.37281. [DOI] [PMC free article] [PubMed]
  • 33.A “Fundamental Theorem” of Biomedical Informatics CHARLES P. FRIEDMAN, P HD). [DOI] [PMC free article] [PubMed]
  • 34.Brenner J, Fulton TB, Marieke Kruidering, et al. What have we learned about constructed response short-answer questions from students and faculty? A multi-institutional study. Med Teach. Published online September 9, 2023:1–10. 10.1080/0142159x.2023.2249209. [DOI] [PubMed]
  • 35.McNamara DS, Crossley SA, Roscoe RD, Allen LK, Dai J. A hierarchical classification approach to automated essay scoring. Assess Writ. 2015; 23:35-59. 10.1016/j.asw.2014.09.002. [Google Scholar]
  • 36.Shermis MD, Burstein JC. Automated Essay Scoring. Routledge; 2003, 71–86. [Google Scholar]
  • 37.McNamara DS, Crossley SA, McCarthy PM. The linguistic features of quality writing. Writ Commun. 2010a;27:57–86 [Google Scholar]
  • 38.Sallam M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare. 2023;11(6):887. 10.3390/healthcare11060887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Yudkowsky R, Hyderi A, Holden J, et al. Can nonclinician raters be trained to assess clinical reasoning in postencounter patient notes? Acad Med. 2019;94: S21-S27. 10.1097/acm.0000000000002904. [DOI] [PubMed] [Google Scholar]
  • 40.Mearian L. How to train your chatbot through prompt engineering. Computerworld. Published March 21, 2023. https://www.computerworld.com/article/3691253/how-to-train-your-chatbot-through-prompt-engineering.html. Accessed 24 Feb 2024
  • 41.How ChatGPT Can Help with Grading. Available at https://blog.tcea.org/chatgpt-grading/. Accessed 24 Feb 2024
  • 42.Atlas S. Chatbot Prompting: A guide for students, educators, and an AI-augmented workforce. Stephen Atlas (Independently published). 2023.
  • 43.Ramesh D, Sanampudi SK. An automated essay scoring system: a systematic literature review. Artif Intell Rev. Published online September 23, 2021. 10.1007/s10462-021-10068-2. [DOI] [PMC free article] [PubMed]
  • 44.Somoye FL. Is Chat GPT free? In short - yes. PC Guide. Published February 24, 2023. https://www.pcguide.com/apps/chat-gpt-free/. Accessed 24 Feb 2024

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

Data will be provided on request.


Articles from Journal of General Internal Medicine are provided here courtesy of Society of General Internal Medicine

RESOURCES