Skip to main content
. 2020 Nov 6;12(11):e11363. doi: 10.7759/cureus.11363

Table 1. Summary of studies included in the rater training scoping review.

BOT - behavioral observation training; FOR – frame of reference training; PDT – Performance dimension training; RET – rater error training; GRS - Global Rating Scale; OSATS - Objective Structured Assessment of Technical Skills; RT - rater training; NOTSS - non-technical skills for surgeons; CEX -  Clinical Evaluation Exercise; CCERR - completed clinical evaluation report rating; ITER - in training evaluation reports;  FITER - final in-training evaluation report; RE - rater error; OR - operating room; ICC - interclass correlation coefficients

Reference Objective or question Study design Setting, population, and n Intervention Control Assessments Outcome Comments/results
Robertson et al. 2018 [15] Canada Single centre Does FOR training improve IRR for evaluations of knot-tying and simple suturing? (five-point GRS, modified OSATS GRS, visual analogue scale) Randomized controlled (stratified block randomization)   Attending surgeons from multiple specialties; (n=47); voluntary Seven-minute FOR training video (n=24) No training (n=23) 10 videos of trainees performing simple suturing and instrument knot tying IRR – intraclass correlation (ICC) type 2: GRS 0.61 (0.41-0.85) no training vs. GRS 0.71 (0.52-0.89) FOR training three assessment tool measured using mixed-model analysis, showing no differences in mean scores. Randomized; simple task; short training session; no statistical difference but trends toward higher ICC with RT
Robertson et al. 2020 [16] Canada Single centre Does RT affect the reliability and validity of four technical skill assessment tools? Randomized controlled (stratified block randomization) Attending surgeons from multiple specialties; (n=47);  voluntary Seven-minute FOR training video (n=24) No training (n=23) 10 videos of trainees performing simple suturing and instrument knot tying with additional assessments at a two-week interval Trend towards higher reliability (Cronbach’s alpha, IRR) and validity but no statistically significant difference. OSATS GRS appears to be preferred. Randomized; simple task; short training session; no statistical difference but trends toward improved assessment tools
Rogers et al. 2002 [13] Canada Single centre Does RT improve IRR for evaluations of  med student knot tying (seven-point GRS)   Randomized RT or none; not described how randomized General surgeons (n=8);  numbers in each group not given; voluntary Rater error training (RET) video (four errors shown three times) No training 24 videos of real students performing hand tie, rated immediately IRR  - Feldt’s t test on Cronbach’s alpha; no effect (a=0.71 RT vs a=0.80 control) Small; simple task; short training; high IRR regardless of training
Spanager et al. 2013  [23] Denmark nine centres Does RT improve reliability rating surgeons non-tech skills (NOTSS; five-point GRS) Cohort study – pre and post-training General surgeons; specialists and fellows (n=15); voluntary Four-hour training workshop (FOR, RET, BOT, PDT) Pre- and post-training Nine scripted videos of OR encounters, rated before and immediately after RT IRR  - Cronbach’s alpha (a=0.96 &0.97 pre and a=0.97 & 0.98 post) Pearson’s for construct validity (=0.95)   Non-technical skills; voluntary/assigned participation; non-randomized; no effect
Cook et al. 2008 [22] US Single centre   Effect RT on IRR and accuracy of mini-CEX scores (internal medicine clinical exam) (five-point GRS) Randomized controlled (21/54 declined randomizing) Medicine faculty (n=31);  voluntary Half-day workshop RE, PDT, BOT, FOR (FOR for more than half the workshop) (n=16) Delayed and no training (n=15); training offered after the second rating 16 scripted videos; four weeks after training IRR – ICC w mixed linear model (ICC = 0.40 pre & 0.43 post for RT; = 0.43 pre & 0.53 post control Log regression -  no significant interaction b/w group and testing period (pre/post) p=0.88 Randomized, the high number declined randomization one-month delay between training and rating; no effect
Weitz et al. 2014 [24] Germany Single centre Does RT improve the accuracy of assessment of physical examination skills? (five-point German grading code) Randomized controlled to RT or none Medical faculty (n=21) 90-minute workshop with in-person, video instruction and discussion (n=11) No training (n=10) 242 students undergoing 10-minute physical exam skills assessment with a standardized patient Reference rating using video-based reassessment of all 242 assessments using GRS and dimension-evaluation. Concordance between reference rating and faculty assessments. No effect of training on rating accuracy detected. Randomized; small sample size; no effect on accuracy
George et al. 2013 [14] US Single centre   Determine type of FOR training for reliable and accurate use of assessment of surgeons using the Zwisch scale (four-point GRS) Quasi-experimental Immersive vs accelerated training, non-randomized (depended on the availability of attending workshops) Surgical faculty; voluntary (n=44) Immersive – FOR with videos and practice testing, discussion Group workshop (n=34) Accelerated one-hour Initial FOR definitions only; individual (n=10)   10 videos of real operations by staff and resident, rated immediately post RT Proportion correct response for accuracy (80.2% immersive vs 88% accelerated); Spearman coefficient for correlation accuracy (0.90 immersive  vs 0.93 accelerated); Cronbach’s for rater bias (0.045 immersive vs 0.049 accelerated).  No differences between the two types of training; cannot rule out underlying differences between the two groups; different scoring rubrics for two groups
Noel et al. 1992 [11] US Multi (12) centres   Determine the accuracy of faculty evaluations of residents clinical skills, how structured form and RT improve evaluations (structured form included four-point GRS) Quasi-experimental; open form vs structured form vs. structured with training (allocation depending on when could attend, times for each random at each site) Internists who serve as clinical evaluators; voluntary (n=203 total; 146 in groups 2 and 3) 15- minutes video on BOT, RET, PDT (n=69) Structured form, no training (n=77) Two scripted videos of resident history and physical on standardized patient; rated immediately Accuracy scores (% correct) – no difference between RT and no training with structured form (64 vs 66% RT for case 1; and 63% vs 64% RT for case 2). Looked at only structured form vs structured form with training for this review; structured form improved accuracy, training had small, non-significant improvements in some areas
Holmboe et al. 2004 [10] US Multi centre (16 programs)   Evaluate the efficacy of direct observation of competence training to change rating behavior (nine-point mini-CEX GRS) Cluster designed randomized control trial; stratified, sealed envelopes; rated pre- and post-training Internist faculty, nominated by program director, then voluntary (n=40) randomized (three lost) Four-day course; “direct observation of competence” on day 2, included FOR, PDT and BOT (n=16) No training (received same info packet as RT group) (n=21) Nine scripted tapes (three cases at three levels of performance), rated eight months after training Confidence intervals and range to estimate IRR (more stringent ratings in training group with smaller range). Regression showed significantly lower ratings. High quality; positive effect; faculty had to be active in teaching for nomination; no differences at baseline ratings
Newble et al. 1980 [21] Australia Single centre Value of RT on the reliability of scores & examiner selection for a clinical exam (standardized checklist with scores) Randomized control trial, rated pre- and post-training Surgical and internist faculty (n=18; nine of each, unsure how selected)   Limited 30-minutes training, individual, PDT (n=6); extensive two-hour group training, PDT & FOR with practice (n=6) No training (rated two months later) (n=6) Five videos of real students with standardized patient (students not aware standardized); rated a few days after RT IRR Kendall’s co-efficient (pre-post scores: 0.48-0.44 intensive, 0.57-0.63 limited, 0.71-0.70 control); Spearman correlation between groups (0.8-0.9). No effect of more intensive training on improving reliability; high training group most inconsistent at baseline (no difference between internists and surgeons); only thing that improved IRR was removing unreliable raters
Van der Vleuten et al. 1989 [17] US Single centre   Does training increase the accuracy of assessments of clinical skills (history & physical standardized checklists) Randomized trial (three groups – doctors, med. students and lay people)   Physicians (surgeons and family doctor), prior examiners (physicians n=22) FOR training with practice 1.5 hours (n=11) No training (n=11) Four tapes (two tapes of two cases) of real students with standardized patient, rated immediately Accuracy % agreement with consensus scores (overall score 82% for RT and 81% control). Only looked at attending cohort; all previous examiners; some mild improvements on individual cases, no significances or statistical comparisons given
Ludbrook et al. 1971 [20] Australia Single centre Does examiner training decrease inter-examiner variability for the clinical skills exam Randomized trial – not stated how or if for sure random; marked in pair (both trained or untrained) Surgical faculty (n=16) 2.5 hour FOR training with videos, practice (n=16) No training (n=16) Medical school class, 100 students (each student in two cases with one pair examiners), marked one-week post- RT   Correlation coefficients between marking pairs (r=0.55 RT and 0.49 control within pairs p<0.01 for both; between pairs r=0.11 RT and 0.14 control p>0.05 for both). No effect on correlation between rater groups; all had previous examiner experience; different scoring rubrics for two groups
Dudek et al. 2013 [19] Canada Multi-centre (four schools) Does training improve the quality of ITERS for medicine residents (CCERR form used to assess ITER quality) Randomized trial of five types of training – varying stages of feedback guide Physicians who supervise medical trainees (n=98; only 37 returned at all required time points) One CCERR score given (n=7), three scores given (n=6), one score + feedback given (n=9), three scores + feedback given (n=5) No feedback (n=10) Whatever ITERS they completed sent in; scores +/- feedback returned every six months x three Mean CCERR scores  - was improvement in scores for feedback groups, but was not significant. Outcome of quality; only gave feedback and scores no true “rater training” workshop; low complete collection (37/98) Significantly underpowered (power calculation n=240)
Dudek et al. 2012 [18] Canada Multicentre (three sites) Effectiveness of workshop at improving ITER Uncontrolled pre/post-training design Physicians who supervise trainees and complete ITERS; voluntary;(n=22) Three-hours workshop (explain good FITER, recognize challenges) None (pre/post-workshop) ITERS of real clinical encounters pre and post-training Mean CCERR scores (18.9 pre-training, 21.7 post p=0.02); ANOVA  (no time interaction with CCERR items so changes were consistent pre/post for all items). Outcome of quality