. 2020 Sep 4;5(3):e10519. doi: 10.1002/aet2.10519

Table 3.

Summary of the Data for Each Tool

Direct Observation Tool	Total Studies(Total Participants)	AssessmentSetting	Accuracy and Reliability	Benefits	Limitations	Resource for Example Tool
CDOT	1 (29)	Clinical	Poor inter‐rater reliability (ICC = −0.04 to 0.25) ³¹	Focused on critical care interventions. Mapped to milestones. Includes a qualitative comments box.	Limited to yes, no, or N/A responses. Poor inter‐rater reliability.	Schott 2015 ³¹
Checklists	4 (219)	Clinical and Simulation	Statistically significant increase for each training level (0.52 levels per year; p < 0.001). ²² Good inter‐rater reliability (ICC = 0.81 to 0.86). ²²	Checklists are targeted to each clinical presentation. May include an area for qualitative feedback. If mapped to milestones, can also be used to evaluate milestones for ACGME.	Each checklist needs to be individually designed for each chief complaint. Primarily focused on specific presentations or aspects of care. Responseoptions often limited to yes, no, or unclear. Qualitative comments vary by checklist.	FitzGerald 2012 ²¹ Hart 2018 ²² Paul 2018 ³⁰
Global Breaking Bad News Assessment Scale	1 (10)	Clinical	Resident skill increased by 90% on subsequent encounter. ²⁸	Short and easy to complete. Study tool can be modified to include a qualitative comments box. ²⁸	Only assesses delivery of bad news. Responses limited to yes or no.	Schildmann 2012 ⁷⁸
Global Rating Scale	3 (435)	Clinical and Simulation	Statistically significant increase for each training level (p < 0.05). ²² Good inter‐rater reliability for clinical management (ICC = 0.74 to 0.87) and communication (ICC = 0.80). ²²	Fewer questions. Faster to perform. Can be combined with other direct observation tools.	Relies heavily on gestalt. Less granular assessment of components. No qualitative comments.	Ander 2012 ¹⁴ Hart 2018 ²²
Local EOS Evaluation	2 (69)	Clinical	N/A	Can includes assessment of technical skills and some non‐technical skills (e.g., professionalism, interpersonal skills)	Categorizations are general with limited specific examples. Not all tools have qualitative comments.	Hoonpongsimanont 2018 ²⁴ Jones 2016 ⁴⁸
McMAP	6 (112)	Clinical	Data on accuracy not available. 12.7% variance between raters. ³⁵	Learner‐centered. Individual clinical assessments were mapped to the ACGME and CanMEDS Frameworks. Tool uses behaviorally anchored scales and includes mandatory written comments.	May have a higher learning curve associated with the 76 unique assessments within the tool. Some components may not be possible to observe depending upon the patients encountered. Learners may avoid cumbersome tasks or those that they are weaker in. Faculty may avoid certain components that are harder to evaluate.	Acai 2019 ³⁴ Chan 2015 ³⁶ Chan 2017 ³⁵
Milestones	9 (911)	Clinical and Simulation	Statistically significant increase for each training level (0.52 levels per year; p < 0.001). ¹⁸ However, faculty may overestimate skills with milestones (92% milestone achievement regardless of training level). ¹⁹ Mean CCC score differed significantly from milestone scores (p < 0.001). ¹⁹ Poor inter‐reliability in one study (ICC = −0.04 to 0.019). ³¹	Addresses a diverse array of technical and nontechnical skills. Already utilized for summative residency assessments that are collected by the ACGME	Many of the milestones may not be applicable for a given patient or shift. Has a risk of grade inflation. ¹⁹ No qualitative comments.	ACGME Milestones ⁵²
Mini‐CEX	4 (596)	Clinical	Did not identify any underperformers that were not already identified by the Australian Resident Medical Officer Assessment Form. ⁴⁷	Includes assessment of technical skills and some nontechnical skills (e.g., professionalism, efficiency). Overall high satisfaction among both learners and assessors. ⁴⁷ Includes a dedicated area for qualitative feedback (strengths and weaknesses).	Does not assess teaching, teamwork, or documentation. Focused on single patient encounters so unable to account for managing multiple patients. Some components may be skipped unless they are required for completion. ⁵⁰	Brazil 2012 ⁴⁷
Minicard	1 (73)	Clinical	Minicard scores increased by 0.021 points per month of training (p < 0.001). ²⁰	Includes comments for each individual assessment item. Includes an action plan at the end.	Inclusion of trainee level in descriptors for scoring may bias results.	Donato 2015 ²⁰
O‐EDShOT	1 (45)	Clinical	Statistically significant increase for each training level (p < 0.001). ³⁷ 38% variance noted in ratings between raters. ³⁷ 13 forms needed for 0.70 reliability. ³⁷ 33 forms needed for 0.80 reliability. ³⁷	Designed specifically for the ED setting with feedback from faculty and residents. Includes an area for qualitative feedback (strengths and weaknesses). Can be used regardless of treatment area (i.e., high, medium, low acuity).	Only evaluated in a single study.	Cheung 2019 ³⁷
OSCE	7 (575)	Clinical and simulation	OSCE was positively correlated with ED performance score (p < 0.001). ³³ Comparing 20‐item OSCE with 40‐item OSCE revealed no difference in accuracy (85.6% vs. 84.5%). ⁴² Variation in ICC between studies (ICC = 0.43 to 0.92). ¹⁷ , ⁴²	Can assess a wide range of factors, including technical and nontechnical skills. Bullard modeled their tool after the ABEM oral board categories. ¹⁷	OSCEs may vary between sites. OSCEs typically need to be individually designed for each presentation.	Bullard 2018 ¹⁷ Paul 2018 ³⁰ Wallenstein 2015 ³³
QSAT	5 (360)	Simulation	Statistically significant increase between PGY 1/2 and PGY 3–5 (p < 0.001). ³⁸ , ⁴⁰ Mean score increased by 10% for each training year (p < 0.01). ³⁹ QSAT total score was moderately correlated with in‐training evaluation report score (r = 0.341; p < 0.01). ⁴¹ Moderate inter‐rater reliability (ICC = 0.56 to 0.89). ²⁵ , ³⁸ , ⁴⁰	Provides a framework that can be customized to each specific case.	Each QSAT would need to be individually designed for each presentation. Studies limited to the simulation environment.	Hall 2015 ⁴⁰ Hall 2017 ⁴¹ Jong 2018 ²⁵
RAT	1 (17)	Simulation	RAT was positively correlated with entrustment scores (r = 0.630; p> 0.01). ⁴⁶ Moderate inter‐rater reliability (ICC = 0.585 to 0.653). ⁴⁶	Builds upon QSAT with entrustable professional activities targeted towards resuscitation management. Designed using a modified Delphi study with experts.	Only assesses resuscitation management. Limited data from a single study.	Weersink 2019 ⁴⁶
RIME	1 (289)	Clinical	Positive correlation between RIME category and clinical evaluation score (r² = 0.40, p < 0.01). ¹⁴ Very weak correlation between RIME category and clinical examination score. ¹⁴	Easy to use. Can be combined with other tools.	Only one study evaluated RIME in the ED. Limited assessment of professional competencies (e.g., work ethic, teamwork, humanistic qualities).	Ander 2012 ¹⁴
SDOT	1 (26)	Clinical	Attending physicians were 54.4% accurate and resident physicians were 49.6% accurate when compared with the criterion standard scoring. ²⁶	Includes assessment of technical skills and some nontechnical skills (e.g., professionalism, interpersonal skills).	Several components may not be applicable to some patient encounters. Does not include an option for qualitative comments. Lower accuracy compared with other tools. May be more time consuming than other direct observation tools.	Kane 2017 ²⁶

ACGME = Accreditation Council for Graduate Medical Education; CCC = clinical competency committee; CDOT = Critical Care Direct Observation Tool; EOS = end of shift; ICC = intraclass correlation; McMAP = McMaster Modular Assessment Program; Mini‐CEX = Mini‐Clinical Evaluation Exercise for Trainees; N/A = not available; O‐EDShOT = Ottawa ED Shift Observation Tool; RAT = Resuscitation Assessment Tool; RIME = Reporter, Interpreter, Manager, Educator; SDOT = Standardized Direct Observation Tool.