Introduction
There is a current wave of enthusiasm about applying artificial intelligence (AI) to medical education. One potential application is in the area of surgical performance, where assessments are necessary to identify potential performance gaps, areas for improvement, and ultimately determine whether a surgeon is competent to perform work within his or her scope of practice [2, 13]. Using AI to perform surgical assessments could reduce interobserver bias of assessors and provide more specific observations about the surgeon being assessed. AI might also shorten the time or number of assessments needed for a surgeon to be able to demonstrate certain skills or competencies by showing that advanced performance in one aspect of a procedure may preclude the necessity for further evaluation in other areas when there are sufficient performance data to demonstrate a statistical basis of association [11].
Therein lies two problems: the need for more data based on measurable criteria and the need for more precise assessment tools. AI models are programs or algorithms that utilize a specific set of data to recognize certain patterns and make predictions based on those data. Since those predictions are contingent upon an enormous amount of contextually-accurate data, the rigor of those data is critical for accurate and reliable AI-generated output. As we have written before [10], we still believe substantial improvements must be made to our assessment tools in order to provide formative assessments during training and summative high-stakes assessments, like board certification. Without the appropriate data models, AI performance assessments of surgeons will not meet the foundational requirements that call for assessment methods to be accurate and fair to the surgeons being assessed, and ultimately, to the patients under their care.
How Do We Get There?
A current problem with many surgical assessments is that consistency, accuracy, and fairness are lacking in their implementation. For example, most surgical assessment tools are largely categorical in form: They simply indicate whether or not a learner demonstrates specific behaviors associated with a milestone instead of using precise measurement frameworks that are tied to explicit performance standards. With categorical and rank-order rating scales, assessors are more likely to assess candidates subjectively rather than against an explicit performance standard, even when evaluating skills for a similar surgical procedure [1, 6, 10].
Clear delineation of objective criteria for establishing standards of performance remains a problem for most aspects of orthopaedic surgery. AI will only work if these explicit performance standards are established by surgeons who are currently credentialed as experts in their given specialty scope of practice [2, 14]. For example, the surgical performance standards for intertrochanteric hip fracture fixation must be determined by collecting performance assessment data objectively from a sufficient number of certified orthopaedic surgeons in order to establish the criteria for the accepted standards for intertrochanteric hip fracture fixation. This would allow for comparison of a surgeon’s performance to the accepted standards, providing for more objective, accurate, and fair assessments derived from experts in the profession, not an arbitrary or imprecise data model [6].
There has been considerable progress in rectifying these challenges in trauma surgery, where checklists or global ratings of surgical skills have been found to consistently overrate a surgeon’s performance when compared to assessment tools that capture performance against explicit standards using interval or criterion-based scoring [1, 5]. Additionally, recent work [3, 4, 7, 8] has established performance standards for critical trauma procedures—including orthopaedic trauma—along with a dataset of nearly 50,000 rigorously-assessed measures for 48 surgical procedures. These data demonstrate psychometric integrity, including discriminant and predictive validity tied to actual trauma care during a mass casualty event [4]. Because these data were captured to measure procedural competencies using parametric scaling, they facilitate the inferential statistical analyses that are essential for fair and accurate assessment of competence in any professional area [5]. This work was achieved at the Uniformed Services University in partnership with the Military Health System Strategic Partnership American College of Surgeons and the ACS Committee on Trauma.
Assessment tools that are both objective and fair are essential for developing a valid and reliable dataset that informs potential AI assessment systems. We recommend that the American Board of Orthopaedic Surgeons and Accreditation Council for Graduate Medical Education sponsor the development of criterion-based assessment tools for surgical procedures that are commonly performed by any orthopaedic surgeon in their professional scope of practice, with associated procedural performance outcomes data reviewed as part of a surgeon’s initial board certification. Procedures that are performed commonly enough to justify the effort needed to develop these kinds of assessments might include carpal tunnel release, primary total hip or total knee arthroplasty, or cephalomedullary nailing for an intertrochanteric fracture [12]. These efforts would gather procedural performance data from a sufficient number of surgeons, establishing a distribution of surgical performance for these procedures [6, 9]. Additionally, these data could be used to determine critical steps within each procedure, thereby narrowing the scope of assessed performance items to those that are critical to perform accurately while maintaining the overall accuracy and fairness of the assessment outcomes [11].
Conclusion
Regardless of whether AI is used, developing accurate and fair competency-based assessments requires a necessary breadth of data. Now is the time to initiate such efforts, given the current wave of enthusiasm to apply AI technology to medicine. Longitudinal efforts on a large scale require multiple centers and funding. With AI funding for large projects currently available, we think this is an excellent opportunity to invest in establishing criterion-based performance standards derived from orthopaedic surgeons themselves, which will allow us to advance our ability to provide more specific and objective performance assessments across the specialty.
Footnotes
A note from the Editor-in-Chief: We are pleased to offer the next installment of “CORR® Curriculum—Orthopaedic Education,” a quarterly column. The goal of this column is to focus on aspects of resident education. We welcome reader feedback on all of our columns and articles; please send your comments to eic@clinorthop.org.
Each author certifies that there are no funding or commercial associations (consultancies, stock ownership, equity interest, patent/licensing arrangements, etc.) that might pose a conflict of interest in connection with the submitted article related to the author or any immediate family members.
All ICMJE Conflict of Interest Forms for the author and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.
The opinions or assertions contained herein are the private ones of the author/speaker and are not to be construed as official or reflecting the views of the Department of Defense, the Uniformed Services University of the Health Sciences or any other agency of the U.S. Government.
The opinions expressed are those of the writer, and do not reflect the opinion or policy of CORR® or The Association of Bone and Joint Surgeons®.
References
- 1.Anderson DD, Long S, Thomas GW, Putnam MD, Bechtold JE, Karam MD. Objective structured assessments of technical skills (OSATS) does not assess the quality of the surgical result effectively. Clin Orthop Rel Res. 2016;474:874-881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Andreatta P, Smith CS, Graybill JC, Bowyer M, Elster EA. Challenges and opportunities for artificial intelligence in surgery. The Journal of Defense Modeling and Simulation. 2022;19:219-227. [Google Scholar]
- 3.Andreatta PB, Bowyer MW, Remick K, Knudson MM, Elster EA. Evidence-based surgical competency outcomes from the clinical readiness program. Ann Surgery. 2023;277:e992-e999. [DOI] [PubMed] [Google Scholar]
- 4.Andreatta PB, Bowyer MW, Renninger CH, Graybill JC, Gurney JM, Elster EA. Putting the ready in readiness: a post-hoc analysis of surgeon performance during a military MASCAL in Afghanistan. J Trauma Acute Care Surg. 2024;97:S119-125. [DOI] [PubMed] [Google Scholar]
- 5.Andreatta PB, Renninger CH, Bowyer MW, Gurney JM. Measuring competency: improving the validity of your procedural performance assessments. Ann Surg. 2023;4:e346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bandalos DL. Measurement Theory and Applications for the Social Sciences. The Guilford Press; 2018. [Google Scholar]
- 7.Bowyer MW, Andreatta PB, Armstrong JH, Remick KN, Elster EA. A novel paradigm for surgical skills training and assessment of competency. JAMA Surg. 2021;156:1103-1109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bradley MJ, Franklin BR, Renninger CH, Graybill JC, Bowyer MW, Andreatta PB. Upper-extremity vascular exposures for trauma: comparative performance outcomes for general surgeons and orthopedic surgeons. Mil Med. 2023;188:e1395-e1400. [DOI] [PubMed] [Google Scholar]
- 9.DeVellis RF, Thorpe CT. Development Scale: Theory and Applications (Applied Social Research Methods Series). 3rd ed. Sage Publications; 2011. [Google Scholar]
- 10.Dougherty P, Andreatta P. Competency based education-how do we get there? Clin Orthop Relat Res. 2017;475:1557-1560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gorsuch RL. Factor Analysis: Classic Edition. 2nd ed. Routledge; 2014. [Google Scholar]
- 12.Kellam JF, Archibald D, Barber JW, et al. The core competencies for general orthopaedic surgeons. J Bone Joint Surg Am. 2017;99:175-181. [DOI] [PubMed] [Google Scholar]
- 13.Kirubarajan A, Young D, Khan S, Crasto N, Sobel M, Sussman D. Artificial intelligence and surgical education: a systematic scoping review of interventions. J Surg Educ. 2022;79:500-515. [DOI] [PubMed] [Google Scholar]
- 14.Sarker IH. AI-based modeling: techniques, applications and research issues towards automation, intelligent and smart systems. SN Comput Sci. 2022;3:158. [DOI] [PMC free article] [PubMed] [Google Scholar]
