Introduction
Assessment remains a major challenge to high-quality implementation of a more competency-based graduate medical education (GME) system.1 Fortunately, the past decade has seen significant progress. In the United States, the Accreditation Council for Graduate Medical Education (ACGME) has developed the Milestones to help define the specific outcomes, or competencies, that medical trainees are expected to develop during their training.2,3 However, training programs still lack standardized, evidence-based approaches to assess performance, use assessment data to make competency decisions, and implement assessment systems successfully. As a result, assessment processes vary substantially among residency programs,4 including within Clinical Competency Committees.5 This variation can waste educational resources, limit educational impact, compromise educational quality, and ultimately threaten ongoing efforts to make GME more competency-based.
To improve the quality of competency-based assessment, an important next step is to examine what assessment tools are used in GME and how data from those tools are used to determine trainee competency. The ACGME Common Program Requirements specify that resident evaluation occurs at specific intervals, including determination of Milestone ratings.6 The ACGME has guidelines for what assessment approaches can be used to inform competency decisions related to the Milestones,7 and other organizations provide lists of assessment tools for specific activities.8,9 However, outside of end-of-rotation evaluations and in-training examinations, few assessments are mandated for medical trainees, and no specific assessment tool is required or recommended by the ACGME. Furthermore, many assessment tools are developed in single institutions and may not be feasible to scale outside the contexts in which they were developed. Even when an assessment tool is implemented more widely, it remains difficult to connect the results of an assessment tool to proficiency standards like the Milestones. Understanding how assessment data have been collected, aggregated, and used to inform decisions about trainee competence is key to understanding progress toward a more evidence-based medical education system.
This report characterizes what assessment tools have been used to determine clinical competency in GME and identifies assessment systems that have demonstrated feasibility in generating valid and fair scores. Using the Evidence Centered Design (ECD) framework,10,11 we examined what assessment tools are being used in GME, what competencies they are being used to assess, and what evidence exists to support their use and implementation.
Our Approach
We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) framework12 to develop a search strategy for identifying all assessment tools used to inform competency decisions in GME. The protocol is available on Open Science Framework. PubMed, Scopus, ERIC, and Embase databases were searched in May 2024 to retrieve articles. Studies were included if they used an assessment tool to measure technical, procedural, operational, or clinical competency within any specialty in GME. Studies were excluded if they were conducted in health professions education outside of medicine, non-English texts, editorial/perspective pieces, or review articles.
A total of 2299 articles were examined during title and abstract screening. Two reviewers K.M.M. and A.P. reviewed each title and abstract, and 642 studies received a full-text review by 2 reviewers, with 454 studies selected. Conflicts regarding whether a study met criteria were resolved through discussion between the 2 reviewers to come to a consensus.
Data elements collected included study characteristics, competencies assessed, details of the assessment tool, scoring methods, implications, and validity evidence of each tool. These articles and the manually extracted data were then used to train and iteratively refine a large language model (LLM) in ChatGPT 4o. The LLM was used to extract specific data elements from the articles. For any data element that required a decision or interpretation, such as the ACGME competency domain that was assessed by the tool, the LLM was trained to provide justification for the extraction decision. This strategy was implemented to facilitate data quality assurance. All elements were manually checked by reviewers. All studies were assessed for quality using the Medical Education Research Study Quality Instrument (MERSQI).13,14
Data were organized according to the ECD, a framework for building and implementing quality assessment systems.10,11,15,16 Within ECD, the Conceptual Assessment Framework16 outlines specific components of an assessment argument. These components include the Competency Model, Task Model, and Evidence Model. The Competency Model specifies what knowledge, skills, or attributes should be assessed. For this report, competencies were categorized based on the 6 core competencies outlined in the ACGME Milestones. The Task Model describes the tasks being performed and elements of the task that elicit a desired construct, such as the setting and rater performing the assessment. The Evidence Model describes how a learner’s performance on an assessment is scored. Assessment tools were categorized into the following: simulation, workplace-based assessment (WBA),17 checklist, written examination, Milestones, biometric data, patient outcomes, and clinical competency committees. Additional extracted data included what type of decisions were made from the assessment data and study using the MERSQI framework (study response rate, data type, outcome type, complexity of analysis, and validity evidence). After extracting data, descriptive analyses were conducted using R (version 4.2.2024).
Evidence for Competency Assessment Tools in GME
A total of 454 studies were examined. Of these, 282 studies were conducted in the United States, 156 studies outside of the United States, and 16 across multiple countries. Study publication dates ranged from 1989 to 2024. Study design was categorized according to the MERSQI categorizations for educational studies (see Table 1). Most studies were cross-sectional, posttest, or pretest/posttest. The 5 most common specialties represented in these studies were general surgery, internal medicine, pediatrics, emergency medicine, and anesthesiology. Fifty-three studies were conducted in more than one specialty, and 147 studies occurred within one specialty at multiple institutions. The median number of participants assessed in the studies was 34, with a range of 1 to 21 499.
Table 1.
Study Characteristics of Included Studies
| Variable | n |
|---|---|
| Country | |
| United States | 282 |
| International | 156 |
| Multiple countries | 16 |
| No. of institutions within study | |
| 1 | 278 |
| 2 | 30 |
| More than 2 | 146 |
| Study design | |
| Single-group cross-sectional or single-group posttest | 240 |
| Single-group pretest and posttest | 179 |
| Nonrandomized, 2 groups | 8 |
| Randomized controlled trial | 27 |
The ACGME competency domain most assessed was Patient Care (Table 2), and 221 studies evaluated more than one competency in the study. Overall, the most common types of assessment tools were simulation assessment tools, workplace-based assessments, checklists, and written assessments (Table 3). In studies that evaluated a single ACGME competency, simulation was the most commonly used assessment tool. For studies that assessed multiple competencies, the Milestones were the most common tool used for aggregating assessment data. The primary rater, or person conducting the assessment, was faculty/attending in 424 studies, other residents and fellows in 17 studies, program directors in 29 studies, nurses and other health care providers in 9 studies, and other raters, such as parents or standardized patients, in 24 studies.
Table 2.
ACGME Competencies Examined in Articles With Most Commonly Associated Assessment Tool and Validity Evidence
| ACGME Competency | N | Assessment Tools | N | Validity Evidence | n (%) |
|---|---|---|---|---|---|
| Patient Care | 311 | Simulation | 153 | Internal structure | 121 (79) |
| Content validity | 128 (84) | ||||
| Relationship to other variables | 109 (71) | ||||
| Workplace-based assessment | 120 | Internal structure | 87 (73) | ||
| Content validity | 89 (74) | ||||
| Relationship to other variables | 80 (67) | ||||
| Checklist | 62 | Internal structure | 54 (87) | ||
| Content validity | 55 (89) | ||||
| Relationship to other variables | 53 (85) | ||||
| Medical Knowledge | 130 | Simulation | 58 | Internal structure | 50 (86) |
| Content validity | 51 (88) | ||||
| Relationship to other variables | 43 (74) | ||||
| Written assessment | 44 | Internal structure | 31 (70) | ||
| Content validity | 33 (75) | ||||
| Relationship to other variables | 28 (64) | ||||
| Checklist | 32 | Internal structure | 28 (88) | ||
| Content validity | 29 (91) | ||||
| Relationship to other variables | 28 (88) | ||||
| Professionalism | 55 | Simulation | 24 | Internal structure | 18 (75) |
| Content validity | 19 (79) | ||||
| Relationship to other variables | 17 (71) | ||||
| Workplace-based assessment | 19 | Internal structure | 17 (89) | ||
| Content validity | 18 (95) | ||||
| Relationship to other variables | 16 (84) | ||||
| Checklist | 12 | Internal structure | 9 (75) | ||
| Content validity | 10 (83) | ||||
| Relationship to other variables | 8 (67) | ||||
| Interpersonal and Communication Skills | 73 | Simulation | 39 | Internal structure | 28 (72) |
| Content validity | 29 (74) | ||||
| Relationship to other variables | 28 (72) | ||||
| Workplace-based assessment | 23 | Internal structure | 17 (74) | ||
| Content validity | 19 (83) | ||||
| Relationship to other variables | 16 (70) | ||||
| Checklist | 18 | Internal structure | 12 (67) | ||
| Content validity | 14 (78) | ||||
| Relationship to other variables | 11 (61) | ||||
| Practice-Based Learning and Improvement | 52 | Simulation | 21 | Internal structure | 19 (90) |
| Content validity | 19 (90) | ||||
| Relationship to other variables | 15 (71) | ||||
| Workplace-based assessment | 17 | Internal structure | 15 (88) | ||
| Content validity | 15 (88) | ||||
| Relationship to other variables | 15 (88) | ||||
| Written assessment | 11 | Internal structure | 8 (73) | ||
| Content validity | 9 (82) | ||||
| Relationship to other variables | 5 (45) | ||||
| Systems-Based Practice | 36 | Simulation | 16 | Internal structure | 12 (75) |
| Content validity | 13 (81) | ||||
| Relationship to other variables | 11 (69) | ||||
| Checklist | 11 | Internal structure | 10 (91) | ||
| Content validity | 11 (100) | ||||
| Relationship to other variables | 10 (91) | ||||
| Workplace-based assessment | 10 | Internal structure | 9 (90) | ||
| Content validity | 9 (90) | ||||
| Relationship to other variables | 7 (70) | ||||
| All competencies assessed together | 79 | Milestones | 23 | Internal structure | 13 (57) |
| Content validity | 17 (74) | ||||
| Relationship to other variables | 13 (57) | ||||
| Workplace-based assessment | 20 | Internal structure | 14 (70) | ||
| Content validity | 12 (60) | ||||
| Relationship to other variables | 13 (65) | ||||
| Clinical Competency Committee | 19 | Internal structure | 13 (68) | ||
| Content validity | 15 (79) | ||||
| Relationship to other variables | 14 (74) |
Abbreviation: ACGME, Accreditation Council for Graduate Medical Education.
Table 3.
The 5 Most Commonly Used Assessment Tool Types
| Assessment Tool Type | Definition | Examples | N | MERSQI Score, Median (IQR) |
|---|---|---|---|---|
| Simulation | Assessment tools that aim to replicate or mirror real-life clinical situations, such as patients, anatomical regions, or clinical tasks.35,36 |
|
171 | 12.50 (11.50-13.00) |
| Workplace-based assessment | Assessment tools that collect information about performance in an authentic working environment.37 |
|
145 | 12.00 (11.00-13.00) |
| Checklist | Assessment tools that list the specific criteria for the skills, behaviors, or attitudes that participants should demonstrate to show successful learning from training.38 |
|
70 | 12.00 (11.12-13.00) |
| Written examination | Assessment tools that test performance through written, computerized, or oral examinations.39,40 |
|
59 | 12.00 (11.00-13.00) |
| Milestones | Assessments of competency-based developmental outcomes (eg, knowledge, skills, attitudes, and performance) that can be demonstrated progressively by residents/fellows from the beginning of their education through graduation to the unsupervised practice of their specialties.2 |
|
28 | 11.00 (9.75-11.75) |
| Clinical Competency Committees | Assessment tools that consist of the Clinical Competency Committee using multiple data sources to determine the ACGME Milestone ratings for trainees. |
|
22 | 11.75 (11.12-12.50) |
| Patient outcomes | Assessment tools that use patient data or outcomes as sources of assessment for residents. |
|
16 | 13.00 (12.25-14.12) |
| Biometric data | Assessment tools that use biometric data, such as motion tracking, as measures of proficiency. |
|
6 | 12.50 (12.12-12.88) |
Abbreviations: MERSQI, Medical Education Research Study Quality Instrument; ACGME, Accreditation Council for Graduate Medical Education.
Most studies used basic summative or average scoring methods to produce a score for a single assessment. Few studies discussed scoring methods that combined multiple assessments over time. One such method described in the articles was using cumulative sum control chart analysis (N=29). A total of 211 studies discussed feasibility of using their assessment tool; however, most of these studies only discussed local implementation. A total of 33 studies included a generalizability study related to their assessment tool. Many studies discussed some type of validity for their assessment tool (Table 2). As evidence for internal validity, 109 studies reported Cronbach’s alpha, 50 studies reported intraclass correlation coefficient, and 9 studies reported Kappa statistics. For content validity, the most cited evidence was expert consensus (152) and alignment with established competencies or guidelines (118).
No studies reported specific methodologies or techniques for translating assessment data from individual assessment tools into determinations of ACGME competencies. A total of 12 studies examined bias in assessment tools or assessment outcomes. The median MERSQI score for included studies was 12 (IQR 11-13), with scores ranging from 6 to 15.5.
Our Interpretations and Recommendations for GME
This examination found that WBAs, simulation, and checklists were the most common assessment tools used in GME. The majority of studies examined an assessment tool in one specialty and at a single institution. Across all studies evaluating a single competency, simulation was the most used assessment tool. When assessing multiple competencies in a single study, Milestones were the most common tool for aggregating assessment data. Many studies reported some validity evidence for an assessment tool, but the reported evidence was highly variable.
WBAs assess trainees in their authentic work environment, providing key insights into a trainee’s ability to perform at a specific point in time, facilitating both assessment for learning and assessment of learning.18 Simulations, by contrast, are performed in a controlled environment that can be directly modified to elicit specific knowledge and abilities and may have consistent raters and rubrics. Simulation-based assessments can be useful for ensuring a trainee masters a specific skill prior to interacting with patients. In addition, consistency in raters and rubrics may facilitate higher levels of assessment reliability. Finally, checklists were commonly used to document that a trainee can appropriately follow steps for a given task. Checklists not only emphasize the outcome of an activity but also the way the activity was carried out—both of which are critical for characterizing a trainee’s competence.19
Patient Care and Medical Knowledge were the most common ACGME competencies assessed because they are likely easier to assess with commonly used assessment tools, such as WBAs, simulations, checklists, and traditional testing formats. Some of the tools used to assess patient care, such as the Society for Improving Medical Professional Learning WBA and objective structured clinical examinations, have been broadly implemented across different institutions, specialties, and assessment settings.20-25 Written or multiple-choice examinations continue to be a popular method for assessing Medical Knowledge. Other competencies, such as Interpersonal and Communication Skills, Professionalism, Systems-Based Practice, and Practice-Based Learning and Improvement, appear to be more difficult to observe in a way that can be reliably assessed by a rater.26-28 Innovative tools, such as portfolios, have shown promising results for authentically assessing trainees in these competencies; however, they can be time-intensive to implement.29,30
Most studies examined assessment tools within a single medical specialty. This finding is notable given that the assessment of competencies such as Interpersonal and Communication Skills, Professionalism, Systems-Based Practice, and Practice-Based Learning and Improvement are likely generalizable. Although there is significant variation in training environments and health care systems, one of the intended goals of Milestones 2.0 is to create harmonized language that is generalizable across different specialties and systems.28 The potential for these competencies to manifest in similar ways across specialties could serve as a motivator to develop robust assessment systems through shared resources and lessons learned.
Our examination demonstrated that there is a substantial number of different assessment tools used in GME. The ingenuity evidenced by different assessment approaches has helped expand the ways in which assessment evidence is collected. While the continued development of assessment systems is to be commended, the diversity of data collection approaches could be better aligned to specific decisions that a user could make based on a learner’s performance, quantified as one or more “scores.” Assessment is an inherently practical activity31; therefore, unless being done for purely research purposes, assessment data should be collected to support formative feedback or specific acceleration, remediation, and certification decisions.
While several studies highlighted the possibility of using one or more observations to support a Milestone subcompetency rating, few if any provided rigorous quantitative rules for doing so. For example, WBAs were argued to be helpful in assessing trainees in their authentic work environment; however, left unaddressed is how different WBA ratings could be aligned to a subcompetency level (eg, PC2 Level 3). Aligning assessment evidence to a subcompetency requires specific mappings between proficiencies thought to drive observable performance, flexible analytical models, and explicit rules for observing multiple types of activities to cover elements of a desired subcompetency.32 Aligning assessment scores to subcompetency levels could not only improve existing assessments but could also highlight competencies with underdeveloped sources of evidence (eg, Professionalism). Similarly, understanding how to align specialty-specific entrustable professional activities (EPAs) to Milestones could help improve assessment quality and interpretation of the growing number of EPA evaluations. Additionally, more focused research on the measurement models used to aggregate observable and varied assessment evidence will be critical in achieving high-quality programmatic assessment (as proposed in the modern CBME model), which relies on multiple data points contributed by a variety of assessors and tools.33
Recommendations
Simulation and WBA tools should be strongly considered when seeking to collect performance data related to Patient Care competencies. While already feasible in the case of WBAs, additional investment is needed to develop approaches across the entire range of Patient Care activities, many of which are specialty specific. Approaches to interpreting the data collected with simulation and WBAs (including specialty-specific EPA evaluations) remain underdeveloped, especially when interpreting data aggregated from multiple assessed performances and in statistically linking the observable performances to estimates of competencies.
Additional work is needed to develop, study, and support assessment of the Interpersonal and Communication Skills, Professionalism, Systems-Based Practice, and Practice-Based Learning and Improvement competencies. Assessment of these competencies may be generalizable across specialties. Where possible, new assessment tools should be developed and implemented while attending to issues of how to aggregate multiple observations and align overall scores to Milestone levels.
More studies should report sources of bias and possible mitigation strategies for reducing bias. Many studies have examined bias in assessment, but few original research studies investigate or report sources of bias for specific assessment tools. Further investigation into bias that exists when developing or implementing assessment tools would help to promote fair assessment practices and, ultimately, better assessment of trainees.34
New tools should be scaled only after they have collected strong validity evidence, clearly articulated how the assessment data should be used to determine competency, and made rigorous efforts to optimize implementation-related factors, including characterizing which contexts are best suited for a specific assessment tool. This early-stage work should be performed in small pilots that include a representative sample of training programs.
The work needed to develop, test, and implement a new assessment should be performed in partnership with consortia of educators, researchers, training programs, and policymakers. This approach improves quality, ensures feasibility, and cost-effectively provides opportunity for more rapid innovation.
Caveats and Considerations
There is significant heterogeneity in current study designs, populations, interventions, outcomes, and measurement methods. This variation complicates data synthesis and makes comparisons difficult. We did not look at data pertaining to the type of training environment (ie, academic medical center, rural, etc) in which the assessments took place. Publication bias may have led to studies being skewed toward positive findings. Additionally, because we trained and used an LLM to perform the initial data extraction, it is possible that some of this data was missed or that the LLM introduced bias.
Conclusions
Assessment in GME is heterogeneous. While many assessment tools have been developed, most studies do not describe how to use the tool to inform competency decisions. To support high-quality and fair assessment in GME, future research should investigate how assessment outcomes can be used to inform competency decisions.
Acknowledgments
The authors would like to thank Emily Capellari, a librarian at the University of Michigan, who was instrumental in designing the search strategy for this report.
Author Note
*Drs Marcotte and Pradarelli are co-first authors and contributed equally to the work.
Editor’s Note
The ACGME News and Views section of JGME includes reports, initiatives, and perspectives from the ACGME and its review committees. This article was not reviewed through the formal JGME peer review process. The decision to publish this article was made by the ACGME. This is an article commissioned by the ACGME to inform the major revision of the Common Program Requirements, a 3-year process currently underway.
References
- 1.Gates RS, Marcotte K, Moreci R, et al. An ideal system of assessment to support competency-based graduate medical education: key attributes and proposed next steps. J Surg Educ. 2024;81(2):172–177. doi: 10.1016/j.jsurg.2023.10.006. doi: [DOI] [PubMed] [Google Scholar]
- 2.Edgar L, McLean S, Sean Hogan SO, Hamstra S, Holmboe ES. Accreditation Council for Graduate Medical Education. The Milestones Guidebook. Published 2020. Accessed February 6, 2026. https://www.acgme.org/Portals/0/MilestonesGuidebook.pdf.
- 3.Accreditation Council for Graduate Medical Education. Frequently asked questions: Milestones; Accessed February 6, 2026. http://www.acgme.org/portals/0/milestonesfaq.pdf. [Google Scholar]
- 4.Luckoski J, Jean D, Thelen A, Mazer L, George B, Kendrick DE. How do programs measure resident performance? A multi-institutional inventory of general surgery assessments. J Surg Educ. 2021;78(6):e189–e195. doi: 10.1016/j.jsurg.2021.08.024. doi: [DOI] [PubMed] [Google Scholar]
- 5.Ekpenyong A, Edgar L, Wilkerson L, Holmboe ES. A multispecialty ethnographic study of clinical competency committees (CCCs) Med Teach. 2022;44(11):1228–1236. doi: 10.1080/0142159X.2022.2072281. doi: [DOI] [PubMed] [Google Scholar]
- 6.Accreditation Council for Graduate Medical Education. Common Program Requirements (Residency) Accessed February 6, 2026. https://www.acgme.org/globalassets/pfassets/programrequirements/2025-reformatted-requirements/cprresidency_2025_reformatted.pdf.
- 7.Holmboe ES, Iobst WF. Accreditation Council for Graduate Medical Education. Assessment Guidebook. Accessed February 6, 2026. https://www.acgme.org/globalassets/pdfs/milestones/guidebooks/assessmentguidebook.pdf.
- 8.Mitchell EL, Arora S, Moneta GL, et al. A systematic review of assessment of skill acquisition and operative competency in vascular surgical training. J Vasc Surg. 2014;59(5):1440–1455. doi: 10.1016/j.jvs.2014.02.018. doi: [DOI] [PubMed] [Google Scholar]
- 9.Watanabe Y, Bilgic E, Lebedeva E, et al. A systematic review of performance assessment tools for laparoscopic cholecystectomy. Surg Endosc. 2016;30(3):832–844. doi: 10.1007/s00464-015-4285-8. doi: [DOI] [PubMed] [Google Scholar]
- 10.Arieli-Attali M, Ward S, Thomas J, Deonovic B, Von Davier AA. The expanded evidence-centered design (e-ECD) for learning and assessment systems: a framework for incorporating learning goals and processes within assessment design. Front Psychol. 2019;10:853. doi: 10.3389/fpsyg.2019.00853. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zieky MJ. An introduction to the use of evidence-centered design in test development. Psicología Educativa. 2014;20(2):79–87. doi: 10.1016/j.pse.2014.11.003. doi: [DOI] [Google Scholar]
- 12.Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. doi: 10.1136/bmj.n71. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Reed DA, Cook DA, Beckman TJ, Levine RB, Kern DE, Wright SM. Association between funding and quality of published medical education research. JAMA. 2007;298(9):1002–1009. doi: 10.1001/jama.298.9.1002. doi: [DOI] [PubMed] [Google Scholar]
- 14.Reed DA, Beckman TJ, Wright SM, Levine RB, Kern DE, Cook DA. Predictive validity evidence for medical education research study quality instrument scores: quality of submissions to JGIM’s medical education special issue. J Gen Intern Med. 2008;23(7):903–907. doi: 10.1007/s11606-008-0664-3. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shute V, Kim YJ, Razzouk R. Evidence Centered Design for Dummies. Accessed February 6, 2026. https://myweb.fsu.edu/vshute/ECD.pdf.
- 16.Mislevy RJ, Almond RG, Lukas JF. A brief introduction to evidence-centered design. ETS Research Report Series. 2003;2003(1):i–29. doi: 10.1002/j.2333-8504.2003.tb01908.x. doi: [DOI] [Google Scholar]
- 17.The National Archives. Workplace Based Assessment. Published 2005. Accessed February 6, 2026. https://webarchive.nationalarchives.gov.uk/ukgwa/20091211203900/http://www.pmetb.org.uk/media/pdf/3/b/PMETB_workplace_based_assemment_paper_(2005).pdf.
- 18.Lockyer J, Carraccio C, Chan MK, et al. Core principles of assessment in competency-based medical education. Med Teach. 2017;39(6):609–616. doi: 10.1080/0142159X.2017.1315082. doi: [DOI] [PubMed] [Google Scholar]
- 19.Messick S. The interplay of evidence and consequences in the validation of performance assessments. Educ Res. 1994;23(2):13. doi: 10.3102/0013189X023002013. doi: [DOI] [Google Scholar]
- 20.Haas MRC, Davis MG, Harvey CE, et al. Implementation of the SIMPL (Society for Improving Medical Professional Learning) performance assessment tool in the emergency department: a pilot study. AEM Educ Train. 2023;7(1):e10842. doi: 10.1002/aet2.10842. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Bohnen JD, George BC, Williams RG, et al. The feasibility of real-time intraoperative performance assessment with SIMPL (system for improving and measuring procedural learning): early experience from a multi-institutional trial. J Surg Educ. 2016;73(6):e118–e130. doi: 10.1016/j.jsurg.2016.08.010. doi: [DOI] [PubMed] [Google Scholar]
- 22.Cox ML, Weaver ML, Johnson C, et al. Early findings and strategies for successful implementation of SIMPL workplace-based assessments within vascular surgery residency and fellowship programs. J Vasc Surg. 2023;78(3):806–814.e2. doi: 10.1016/j.jvs.2023.04.039. doi: [DOI] [PubMed] [Google Scholar]
- 23.Schuwirth LW, van der Vleuten CP. How ‘testing’ has become ‘programmatic assessment for learning.’. Health Prof Educ. 2019;5(3):177–184. doi: 10.1016/j.hpe.2018.06.005. doi: [DOI] [Google Scholar]
- 24.Norcini J, Anderson MB, Bollela V, et al. 2018 consensus framework for good assessment. Med Teach. 2018;40(11):1102–1109. doi: 10.1080/0142159X.2018.1500016. doi: [DOI] [PubMed] [Google Scholar]
- 25.Khan KZ, Ramachandran S, Gaunt K, Pushkar P. The objective structured clinical examination (OSCE): AMEE guide no. 81. Part I: an historical and theoretical perspective. Med Teach. 2013;35(9):e1437–e1446. doi: 10.3109/0142159X.2013.818634. doi: [DOI] [PubMed] [Google Scholar]
- 26.Guralnick S, Fondahn E, Amin A, Bittner EA. Systems-based practice: time to finally adopt the orphan competency. J Grad Med Educ. 2021;13(suppl 2):96–101. doi: 10.4300/JGME-D-20-00839.1. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Edgar L, Roberts S, Holmboe H. Milestones 2.0: a step forward. J Grad Med Educ. 2018;10(3):367–369. doi: 10.4300/JGME-D-18-00372.1. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Martinez J, Phillips E, Harris C. Where do we go from here? Moving from systems-based practice process measures to true competency via developmental milestones. Med Educ Online. 2014;19:24441. doi: 10.3402/meo.v19.24441. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.O’Sullivan PS, Reckase MD, McClain T, Savidge MA, Clardy JA. Demonstration of portfolios to assess competency of residents. Adv Health Sci Educ Theory Pract. 2004;9(4):309–323. doi: 10.1007/s10459-004-0885-0. doi: [DOI] [PubMed] [Google Scholar]
- 30.McEwen LA, Griffiths J, Schultz K. Developing and successfully implementing a competency-based portfolio assessment system in a postgraduate family medicine residency program. Acad Med. 2015;90(11):1515–1526. doi: 10.1097/ACM.0000000000000754. doi: [DOI] [PubMed] [Google Scholar]
- 31.Briggs DC. Historical and Conceptual Foundations of Measurement in the Human Sciences: Credos and Controversies. Routledge; 2022. [Google Scholar]
- 32.Messick S. Standards of validity and the validity of standards in performance assessment. Educ Measurment. 1995;14(4):5–8. doi: 10.1111/j.1745-3992.1995.tb00881.x. doi: [DOI] [Google Scholar]
- 33.Van Melle E, Frank JR, Holmboe ES et al. A core components framework for evaluating implementation of competency-based medical education programs. Acad Med. 2019;94(7):1002–1009. doi: 10.1097/ACM.0000000000002743. doi: [DOI] [PubMed] [Google Scholar]
- 34.Suk Y, Han KT. A psychometric framework for evaluating fairness in algorithmic decision making: differential algorithmic functioning. J Educ Behav Stat. 2024;49(2):151–172. doi: 10.3102/10769986231171711. [Google Scholar]
- 35.Issenberg B, Scalese R. Simulation in health care education. Perspect Biol Med. 2008;51(1):31–46. doi: 10.1353/pbm.2008.0004. doi: [DOI] [PubMed] [Google Scholar]
- 36.Ryall T, Judd BK, Gordon CJ. Simulation-based assessments in health professional education: a systematic review. J Multidiscip Healthc. 2016;9:69–82. doi: 10.2147/JMDH.S92695. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Liu C. An introduction to workplace-based assessments. Gastroenterol Hepatol Bed Bench. 2012;5(1):24–28. [PMC free article] [PubMed] [Google Scholar]
- 38.Shayne P, Gallahue F, Rinnert S, Anderson C, Hern G, Katz E. Reliability of a core competency checklist assessment in the emergency department: the standardized direct observation assessment tool. Acad Emerg Med. 2006;13(7):727–732. doi: 10.1197/j.aem.2006.01.030. doi: [DOI] [PubMed] [Google Scholar]
- 39.Schuwirth L, Van der Vleuten CP. ABC of learning and teaching in medicine: written assessment. BMJ. 2003;326(7390):643–645. doi: 10.1136/bmj.326.7390.643. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Swanson DB, Norcini JJ, Grosso LJ. Assessment of clinical competence: written and computer‐based simulations. Assess Eval Higher Educ. 1987;12(3):220–246. doi: 10.1080/0260293870120307. [Google Scholar]
