Reference |
Objective or question |
Study design |
Setting, population, and n |
Intervention |
Control |
Assessments |
Outcome |
Comments/results |
Robertson et al. 2018 [15] Canada Single centre |
Does FOR training improve IRR for evaluations of knot-tying and simple suturing? (five-point GRS, modified OSATS GRS, visual analogue scale) |
Randomized controlled (stratified block randomization) |
Attending surgeons from multiple specialties; (n=47); voluntary |
Seven-minute FOR training video (n=24) |
No training (n=23) |
10 videos of trainees performing simple suturing and instrument knot tying |
IRR – intraclass correlation (ICC) type 2: GRS 0.61 (0.41-0.85) no training vs. GRS 0.71 (0.52-0.89) FOR training three assessment tool measured using mixed-model analysis, showing no differences in mean scores. |
Randomized; simple task; short training session; no statistical difference but trends toward higher ICC with RT |
Robertson et al. 2020 [16] Canada Single centre |
Does RT affect the reliability and validity of four technical skill assessment tools? |
Randomized controlled (stratified block randomization) |
Attending surgeons from multiple specialties; (n=47); voluntary |
Seven-minute FOR training video (n=24) |
No training (n=23) |
10 videos of trainees performing simple suturing and instrument knot tying with additional assessments at a two-week interval |
Trend towards higher reliability (Cronbach’s alpha, IRR) and validity but no statistically significant difference. OSATS GRS appears to be preferred. |
Randomized; simple task; short training session; no statistical difference but trends toward improved assessment tools |
Rogers et al. 2002 [13] Canada Single centre |
Does RT improve IRR for evaluations of med student knot tying (seven-point GRS) |
Randomized RT or none; not described how randomized |
General surgeons (n=8); numbers in each group not given; voluntary |
Rater error training (RET) video (four errors shown three times) |
No training |
24 videos of real students performing hand tie, rated immediately |
IRR - Feldt’s t test on Cronbach’s alpha; no effect (a=0.71 RT vs a=0.80 control) |
Small; simple task; short training; high IRR regardless of training |
Spanager et al. 2013 [23] Denmark nine centres |
Does RT improve reliability rating surgeons non-tech skills (NOTSS; five-point GRS) |
Cohort study – pre and post-training |
General surgeons; specialists and fellows (n=15); voluntary |
Four-hour training workshop (FOR, RET, BOT, PDT) |
Pre- and post-training |
Nine scripted videos of OR encounters, rated before and immediately after RT |
IRR - Cronbach’s alpha (a=0.96 &0.97 pre and a=0.97 & 0.98 post) Pearson’s for construct validity (=0.95) |
Non-technical skills; voluntary/assigned participation; non-randomized; no effect |
Cook et al. 2008 [22] US Single centre |
Effect RT on IRR and accuracy of mini-CEX scores (internal medicine clinical exam) (five-point GRS) |
Randomized controlled (21/54 declined randomizing) |
Medicine faculty (n=31); voluntary |
Half-day workshop RE, PDT, BOT, FOR (FOR for more than half the workshop) (n=16) |
Delayed and no training (n=15); training offered after the second rating |
16 scripted videos; four weeks after training |
IRR – ICC w mixed linear model (ICC = 0.40 pre & 0.43 post for RT; = 0.43 pre & 0.53 post control Log regression - no significant interaction b/w group and testing period (pre/post) p=0.88 |
Randomized, the high number declined randomization one-month delay between training and rating; no effect |
Weitz et al. 2014 [24] Germany Single centre |
Does RT improve the accuracy of assessment of physical examination skills? (five-point German grading code) |
Randomized controlled to RT or none |
Medical faculty (n=21) |
90-minute workshop with in-person, video instruction and discussion (n=11) |
No training (n=10) |
242 students undergoing 10-minute physical exam skills assessment with a standardized patient |
Reference rating using video-based reassessment of all 242 assessments using GRS and dimension-evaluation. Concordance between reference rating and faculty assessments. No effect of training on rating accuracy detected. |
Randomized; small sample size; no effect on accuracy |
George et al. 2013 [14] US Single centre |
Determine type of FOR training for reliable and accurate use of assessment of surgeons using the Zwisch scale (four-point GRS) |
Quasi-experimental Immersive vs accelerated training, non-randomized (depended on the availability of attending workshops) |
Surgical faculty; voluntary (n=44) |
Immersive – FOR with videos and practice testing, discussion Group workshop (n=34) |
Accelerated one-hour Initial FOR definitions only; individual (n=10) |
10 videos of real operations by staff and resident, rated immediately post RT |
Proportion correct response for accuracy (80.2% immersive vs 88% accelerated); Spearman coefficient for correlation accuracy (0.90 immersive vs 0.93 accelerated); Cronbach’s for rater bias (0.045 immersive vs 0.049 accelerated). |
No differences between the two types of training; cannot rule out underlying differences between the two groups; different scoring rubrics for two groups |
Noel et al. 1992 [11] US Multi (12) centres |
Determine the accuracy of faculty evaluations of residents clinical skills, how structured form and RT improve evaluations (structured form included four-point GRS) |
Quasi-experimental; open form vs structured form vs. structured with training (allocation depending on when could attend, times for each random at each site) |
Internists who serve as clinical evaluators; voluntary (n=203 total; 146 in groups 2 and 3) |
15- minutes video on BOT, RET, PDT (n=69) |
Structured form, no training (n=77) |
Two scripted videos of resident history and physical on standardized patient; rated immediately |
Accuracy scores (% correct) – no difference between RT and no training with structured form (64 vs 66% RT for case 1; and 63% vs 64% RT for case 2). |
Looked at only structured form vs structured form with training for this review; structured form improved accuracy, training had small, non-significant improvements in some areas |
Holmboe et al. 2004 [10] US Multi centre (16 programs) |
Evaluate the efficacy of direct observation of competence training to change rating behavior (nine-point mini-CEX GRS) |
Cluster designed randomized control trial; stratified, sealed envelopes; rated pre- and post-training |
Internist faculty, nominated by program director, then voluntary (n=40) randomized (three lost) |
Four-day course; “direct observation of competence” on day 2, included FOR, PDT and BOT (n=16) |
No training (received same info packet as RT group) (n=21) |
Nine scripted tapes (three cases at three levels of performance), rated eight months after training |
Confidence intervals and range to estimate IRR (more stringent ratings in training group with smaller range). Regression showed significantly lower ratings. |
High quality; positive effect; faculty had to be active in teaching for nomination; no differences at baseline ratings |
Newble et al. 1980 [21] Australia Single centre |
Value of RT on the reliability of scores & examiner selection for a clinical exam (standardized checklist with scores) |
Randomized control trial, rated pre- and post-training |
Surgical and internist faculty (n=18; nine of each, unsure how selected) |
Limited 30-minutes training, individual, PDT (n=6); extensive two-hour group training, PDT & FOR with practice (n=6) |
No training (rated two months later) (n=6) |
Five videos of real students with standardized patient (students not aware standardized); rated a few days after RT |
IRR Kendall’s co-efficient (pre-post scores: 0.48-0.44 intensive, 0.57-0.63 limited, 0.71-0.70 control); Spearman correlation between groups (0.8-0.9). |
No effect of more intensive training on improving reliability; high training group most inconsistent at baseline (no difference between internists and surgeons); only thing that improved IRR was removing unreliable raters |
Van der Vleuten et al. 1989 [17] US Single centre |
Does training increase the accuracy of assessments of clinical skills (history & physical standardized checklists) |
Randomized trial (three groups – doctors, med. students and lay people) |
Physicians (surgeons and family doctor), prior examiners (physicians n=22) |
FOR training with practice 1.5 hours (n=11) |
No training (n=11) |
Four tapes (two tapes of two cases) of real students with standardized patient, rated immediately |
Accuracy % agreement with consensus scores (overall score 82% for RT and 81% control). |
Only looked at attending cohort; all previous examiners; some mild improvements on individual cases, no significances or statistical comparisons given |
Ludbrook et al. 1971 [20] Australia Single centre |
Does examiner training decrease inter-examiner variability for the clinical skills exam |
Randomized trial – not stated how or if for sure random; marked in pair (both trained or untrained) |
Surgical faculty (n=16) |
2.5 hour FOR training with videos, practice (n=16) |
No training (n=16) |
Medical school class, 100 students (each student in two cases with one pair examiners), marked one-week post- RT |
Correlation coefficients between marking pairs (r=0.55 RT and 0.49 control within pairs p<0.01 for both; between pairs r=0.11 RT and 0.14 control p>0.05 for both). |
No effect on correlation between rater groups; all had previous examiner experience; different scoring rubrics for two groups |
Dudek et al. 2013 [19] Canada Multi-centre (four schools) |
Does training improve the quality of ITERS for medicine residents (CCERR form used to assess ITER quality) |
Randomized trial of five types of training – varying stages of feedback guide |
Physicians who supervise medical trainees (n=98; only 37 returned at all required time points) |
One CCERR score given (n=7), three scores given (n=6), one score + feedback given (n=9), three scores + feedback given (n=5) |
No feedback (n=10) |
Whatever ITERS they completed sent in; scores +/- feedback returned every six months x three |
Mean CCERR scores - was improvement in scores for feedback groups, but was not significant. |
Outcome of quality; only gave feedback and scores no true “rater training” workshop; low complete collection (37/98) Significantly underpowered (power calculation n=240) |
Dudek et al. 2012 [18] Canada Multicentre (three sites) |
Effectiveness of workshop at improving ITER |
Uncontrolled pre/post-training design |
Physicians who supervise trainees and complete ITERS; voluntary;(n=22) |
Three-hours workshop (explain good FITER, recognize challenges) |
None (pre/post-workshop) |
ITERS of real clinical encounters pre and post-training |
Mean CCERR scores (18.9 pre-training, 21.7 post p=0.02); ANOVA (no time interaction with CCERR items so changes were consistent pre/post for all items). |
Outcome of quality |