Table 3.
Direct Observation Tool | Total Studies(Total Participants) | AssessmentSetting | Accuracy and Reliability | Benefits | Limitations | Resource for Example Tool |
---|---|---|---|---|---|---|
CDOT | 1 (29) | Clinical | Poor inter‐rater reliability (ICC = −0.04 to 0.25) 31 | Focused on critical care interventions. Mapped to milestones. Includes a qualitative comments box. | Limited to yes, no, or N/A responses. Poor inter‐rater reliability. | Schott 2015 31 |
Checklists | 4 (219) | Clinical and Simulation | Statistically significant increase for each training level (0.52 levels per year; p < 0.001). 22 Good inter‐rater reliability (ICC = 0.81 to 0.86). 22 | Checklists are targeted to each clinical presentation. May include an area for qualitative feedback. If mapped to milestones, can also be used to evaluate milestones for ACGME. | Each checklist needs to be individually designed for each chief complaint. Primarily focused on specific presentations or aspects of care. Responseoptions often limited to yes, no, or unclear. Qualitative comments vary by checklist. |
FitzGerald 2012 21 Hart 2018 22 Paul 2018 30 |
Global Breaking Bad News Assessment Scale | 1 (10) | Clinical | Resident skill increased by 90% on subsequent encounter. 28 | Short and easy to complete. Study tool can be modified to include a qualitative comments box. 28 | Only assesses delivery of bad news. Responses limited to yes or no. | Schildmann 2012 78 |
Global Rating Scale | 3 (435) | Clinical and Simulation | Statistically significant increase for each training level (p < 0.05). 22 Good inter‐rater reliability for clinical management (ICC = 0.74 to 0.87) and communication (ICC = 0.80). 22 | Fewer questions. Faster to perform. Can be combined with other direct observation tools. | Relies heavily on gestalt. Less granular assessment of components. No qualitative comments. |
Ander 2012 14 Hart 2018 22 |
Local EOS Evaluation | 2 (69) | Clinical | N/A | Can includes assessment of technical skills and some non‐technical skills (e.g., professionalism, interpersonal skills) | Categorizations are general with limited specific examples. Not all tools have qualitative comments. |
Hoonpongsimanont 2018 24 Jones 2016 48 |
McMAP | 6 (112) | Clinical | Data on accuracy not available. 12.7% variance between raters. 35 | Learner‐centered. Individual clinical assessments were mapped to the ACGME and CanMEDS Frameworks. Tool uses behaviorally anchored scales and includes mandatory written comments. | May have a higher learning curve associated with the 76 unique assessments within the tool. Some components may not be possible to observe depending upon the patients encountered. Learners may avoid cumbersome tasks or those that they are weaker in. Faculty may avoid certain components that are harder to evaluate. |
Acai 2019 34 Chan 2015 36 Chan 2017 35 |
Milestones | 9 (911) | Clinical and Simulation | Statistically significant increase for each training level (0.52 levels per year; p < 0.001). 18 However, faculty may overestimate skills with milestones (92% milestone achievement regardless of training level). 19 Mean CCC score differed significantly from milestone scores (p < 0.001). 19 Poor inter‐reliability in one study (ICC = −0.04 to 0.019). 31 | Addresses a diverse array of technical and nontechnical skills. Already utilized for summative residency assessments that are collected by the ACGME | Many of the milestones may not be applicable for a given patient or shift. Has a risk of grade inflation. 19 No qualitative comments. | ACGME Milestones 52 |
Mini‐CEX | 4 (596) | Clinical | Did not identify any underperformers that were not already identified by the Australian Resident Medical Officer Assessment Form. 47 | Includes assessment of technical skills and some nontechnical skills (e.g., professionalism, efficiency). Overall high satisfaction among both learners and assessors. 47 Includes a dedicated area for qualitative feedback (strengths and weaknesses). | Does not assess teaching, teamwork, or documentation. Focused on single patient encounters so unable to account for managing multiple patients. Some components may be skipped unless they are required for completion. 50 | Brazil 2012 47 |
Minicard | 1 (73) | Clinical | Minicard scores increased by 0.021 points per month of training (p < 0.001). 20 | Includes comments for each individual assessment item. Includes an action plan at the end. | Inclusion of trainee level in descriptors for scoring may bias results. | Donato 2015 20 |
O‐EDShOT | 1 (45) | Clinical | Statistically significant increase for each training level (p < 0.001). 37 38% variance noted in ratings between raters. 37 13 forms needed for 0.70 reliability. 37 33 forms needed for 0.80 reliability. 37 | Designed specifically for the ED setting with feedback from faculty and residents. Includes an area for qualitative feedback (strengths and weaknesses). Can be used regardless of treatment area (i.e., high, medium, low acuity). | Only evaluated in a single study. | Cheung 2019 37 |
OSCE | 7 (575) | Clinical and simulation | OSCE was positively correlated with ED performance score (p < 0.001). 33 Comparing 20‐item OSCE with 40‐item OSCE revealed no difference in accuracy (85.6% vs. 84.5%). 42 Variation in ICC between studies (ICC = 0.43 to 0.92). 17 , 42 | Can assess a wide range of factors, including technical and nontechnical skills. Bullard modeled their tool after the ABEM oral board categories. 17 | OSCEs may vary between sites. OSCEs typically need to be individually designed for each presentation. |
Bullard 2018 17 Paul 2018 30 Wallenstein 2015 33 |
QSAT | 5 (360) | Simulation | Statistically significant increase between PGY 1/2 and PGY 3–5 (p < 0.001). 38 , 40 Mean score increased by 10% for each training year (p < 0.01). 39 QSAT total score was moderately correlated with in‐training evaluation report score (r = 0.341; p < 0.01). 41 Moderate inter‐rater reliability (ICC = 0.56 to 0.89). 25 , 38 , 40 | Provides a framework that can be customized to each specific case. | Each QSAT would need to be individually designed for each presentation. Studies limited to the simulation environment. |
Hall 2015 40 Hall 2017 41 Jong 2018 25 |
RAT | 1 (17) | Simulation | RAT was positively correlated with entrustment scores (r = 0.630; p> 0.01). 46 Moderate inter‐rater reliability (ICC = 0.585 to 0.653). 46 | Builds upon QSAT with entrustable professional activities targeted towards resuscitation management. Designed using a modified Delphi study with experts. | Only assesses resuscitation management. Limited data from a single study. | Weersink 2019 46 |
RIME | 1 (289) | Clinical | Positive correlation between RIME category and clinical evaluation score (r2 = 0.40, p < 0.01). 14 Very weak correlation between RIME category and clinical examination score. 14 | Easy to use. Can be combined with other tools. | Only one study evaluated RIME in the ED. Limited assessment of professional competencies (e.g., work ethic, teamwork, humanistic qualities). | Ander 2012 14 |
SDOT | 1 (26) | Clinical | Attending physicians were 54.4% accurate and resident physicians were 49.6% accurate when compared with the criterion standard scoring. 26 | Includes assessment of technical skills and some nontechnical skills (e.g., professionalism, interpersonal skills). | Several components may not be applicable to some patient encounters. Does not include an option for qualitative comments. Lower accuracy compared with other tools. May be more time consuming than other direct observation tools. | Kane 2017 26 |
ACGME = Accreditation Council for Graduate Medical Education; CCC = clinical competency committee; CDOT = Critical Care Direct Observation Tool; EOS = end of shift; ICC = intraclass correlation; McMAP = McMaster Modular Assessment Program; Mini‐CEX = Mini‐Clinical Evaluation Exercise for Trainees; N/A = not available; O‐EDShOT = Ottawa ED Shift Observation Tool; RAT = Resuscitation Assessment Tool; RIME = Reporter, Interpreter, Manager, Educator; SDOT = Standardized Direct Observation Tool.