Abstract
Background
A key component of competency‐based medical education (CBME) is direct observation of trainees. Direct observation has been emphasized as integral to workplace‐based assessment (WBA) yet previously identified challenges may limit its successful implementation. Given these challenges, it is imperative to fully understand the value of direct observation within a CBME program of assessment. Specifically, it is not known whether the quality of WBA documentation is influenced by observation type (direct or indirect).
Methods
The objective of this study was to determine the influence of observation type (direct or indirect) on quality of entrustable professional activity (EPA) assessment documentation within a CBME program. EPA assessments were scored by four raters using the Quality of Assessment for Learning (QuAL) instrument, a previously published three‐item quantitative measure of the quality of written comments associated with a single clinical performance score. An analysis of variance was performed to compare mean QuAL scores among the direct and indirect observation groups. The reliability of the QuAL instrument for EPA assessments was calculated using a generalizability analysis.
Results
A total of 244 EPA assessments (122 direct observation, 122 indirect observation) were rated for quality using the QuAL instrument. No difference in mean QuAL score was identified between the direct and indirect observation groups (p = 0.17). The reliability of the QuAL instrument for EPA assessments was 0.84.
Conclusions
Observation type (direct or indirect) did not influence the quality of EPA assessment documentation. This finding raises the question of how direct and indirect observation truly differ and the implications for meta‐raters such as competence committees responsible for making judgments related to trainee promotion.
INTRODUCTION
Over the past 20 years, there has been a global movement within postgraduate medical training toward competency‐based medical education (CBME). 1 A key component of CBME is workplace‐based assessment (WBA), which includes observation of trainees. 1 , 2 Observation during WBA has traditionally been distinguished into two types: direct and indirect. 3 Direct observation has been broadly defined as the active process of watching trainees perform to develop an understanding of how they apply their knowledge and skills to clinical practice. 4 Conversely, indirect observation comprises observations that occur when the supervisor has not directly watched the trainee perform the task being assessed. 3 , 5 Following indirect observation, assessment of trainee performance is made based on inferences and from surrogate data. 5 , 6
Observation of trainees has been emphasized within CBME as it allows for authentic judgments of competence in the workplace. 7 Direct observation has been shown to provide reliable and valid assessment data across a range of competencies. 8 Further, direct observation has been found to foster supervisor–trainee trust and promote effective feedback. 9 , 10 Despite many purported benefits, adoption of direct observation during clinical supervision has remained challenging in CBME due to a number of barriers including trainee concerns they are burdening their supervisor and supervisor fears of decreasing trainee autonomy. 11 , 12 A common barrier to direct observation identified by both supervisors and trainees is the amount of time required to perform direct observation and the impact on efficiency of clinical care. 11 , 13 , 14 , 15 A tension exists for both trainees and supervisors as they balance direct observation requirements with the provision of effective and safe patient care. Given the challenges associated with direct observation, it is imperative that medical educators understand the value of direct observation within a CBME program of assessment.
The adoption and implementation of CBME within Canada has led to widespread changes across specialty training programs including the sequencing of trainee progression into stages. 16 Each stage is associated with a unique set of specialty‐specific entrustable professional activities (EPAs). EPAs represent the essential tasks or responsibilities of a specialty that an individual can be trusted to perform in a given health care context, once sufficient competence has been demonstrated. 17 Documented EPA assessments contain narrative comments associated with a single clinical performance score anchored on an entrustment scale. Within Canadian CBME programs, EPA assessments are the primary form of observed and documented WBA. 3 Through EPA assessments, trainees demonstrate competence in each stage of training before promotion within their training program. 18
Observations that occur in the workplace must be both rated and documented for review by meta‐raters such as competence committees. 19 , 20 Competence committees are ultimately responsible for the aggregation and review of trainee assessment data, which informs high‐stakes judgments relating to trainee promotion and readiness for independent practice. 20 , 21 Failure to provide competence committees with high‐quality trainee assessment data places the committee at risk of suffering from a “garbage in, garbage out” phenomenon as decisions made by meta‐raters on the committee have been informed by poor quality data.
Given that direct observation has been found to foster supervisor–trainee trust and promote effective feedback, it is possible that this form of observation leads to higher quality trainee assessment data. 9 , 10 At present, it is not known whether the quality of EPA assessments is influenced by observation type (direct vs. indirect observation). Therefore, the objective of this study was to determine the influence of observation type (direct or indirect) on quality of EPA assessments within a CBME program.
METHODS
Study setting and WBA tool
This study was conducted in the department of emergency medicine (DEM) at The Ottawa Hospital. The DEM is a large, academic, tertiary care emergency department (ED) with over 170,000 patient visits per year across two hospital campuses. All DEM faculty (n = 85) were eligible to provide clinical supervision to trainees and complete EPA assessments. During the Canadian implementation of CBME across emergency medicine training programs in 2018, all DEM faculty received an in‐person training session on how to observe and document an EPA assessment. DEM faculty did not receive any additional training on assessment as part of this study.
The DEM supports an emergency medicine training program of 50 trainees with both CBME and non‐CBME trainee cohorts. At the time of this study, there were 19 CBME trainees. While rotating in the ED, trainees are scheduled for 14–16 shifts per month with different faculty supervisors. Each 6‐ to 8‐h shift provides trainees with exposure to a full range of patient acuity levels from acute resuscitation to ambulatory care.
For CBME trainees, EPA assessments are the primary method of WBA. A national set of emergency medicine EPA assessments have been developed which are grouped into four stages of training: transition to discipline, foundations of discipline, core of discipline, and transition to practice (Data S1). 16 Each EPA assessment contains a narrative comment section associated with a single clinical performance score anchored on an entrustment scale (Data S2). 22 Both trainees and supervisors have an important role in the initiation and documentation EPA assessments in the workplace (Figure 1). EPA assessments are documented and stored electronically using an online database.
FIGURE 1.

Process of EPA selection, observation, and documentation. EPA, entrustable professional activity
Study population
The study population consisted of all emergency medicine CBME trainees who collected EPA assessments (n = 17) during the study period (December 15, 2019–April 30, 2020). Two CBME trainees were not on rotation in the ED during the study period and did not collect EPAs. CBME trainees were in either the foundations or the core stages of training. Non‐CBME trainees and medical students were excluded from this study as they do not use the same set of EPA assessments.
Beginning December 1, 2019, the DEM instituted a policy whereby it became mandatory for emergency medicine CBME trainees to indicate whether EPA assessments were documented following either direct or indirect observation by their supervisor. Trainees were instructed to document the type of observation (direct or indirect) within the EPA assessment written comment field. To facilitate this policy change, trainees received definitions for direct and indirect observation including examples for reference (Data S3). This policy was disseminated in‐person and through institutional emails.
Outcome measure
The Quality of Assessment for Learning (QuAL) instrument was used to measure the quality of EPA assessments. 23 The QuAL is a short, three‐item instrument designed to evaluate the quality of written comments associated with a single clinical performance score (Table 1). 23 Use of the instrument does not require raters to undertake any training other than to read the instructions. A QuAL score of 5 represents the highest possible score whereas a score of 0 is the lowest possible score. The QuAL has previously been shown to measure the quality of short written comments associated with a single clinical performance score with a high degree of reliability. 23 Further, the QuAL correlates closely with rater perception of utility providing validity evidence. 23
TABLE 1.
QuAL instrument
| Evidence: Does the rater provide sufficient evidence about trainee performance? |
| 0 No comment at all |
| 1 No, but comment present |
| 2 Somewhat |
| 3 Yes, full description |
| Suggestion: Does the rater provide a suggestion for improvement? |
| 0 No |
| 1 Yes |
| Connection: Is the rater's suggestion linked to the behavior described? |
| 0 No |
| 1 Yes |
Note: Adapted from Chan et al., 2020. 23
Abbreviation: QuAL, Quality of Assessment for Learning.
Procedures
Figure 2 outlines the process through which the EPA assessments used in the rating exercise were selected. All EPA assessment forms completed during the study period were identified using the DEM's online database and manually reviewed by the author (JML) to identify if type of observation (direct or indirect) was documented on the EPA assessment form. This subgroup of EPA assessments was then further categorized by type of observation (direct or indirect), type of EPA, trainee, and supervisor. EPAs that were predominantly (>80%) directly or indirectly observed were excluded from the rating exercise as these EPAs were felt to contain intrinsic design elements that promoted a particular type of observation. 24 The remaining EPA assessments were further categorized according to the completing supervisor. In the instance where a supervisor had an unequal number of EPA assessments between the direct and indirect observation groups, the EPA assessments included were randomly selected using computer‐generated sampling (Microsoft Excel version 16.16.20).
FIGURE 2.

Selection of EPA assessments for final rating exercise. EPA, entrustable professional activity
Based on prior studies, we estimated that 25 EPA assessments in each observation group would be needed to detect a significant difference with an effect size of 0.80, assuming a level of significance of p = 0.05, power of 0.80, and a standard deviation of 1.09. 23 EPA assessments were deidentified with regard to type of observation, trainee, and supervisor. A unique study number was assigned to identify each EPA assessment. Four blinded physician raters independently scored the EPA assessments using the QuAL instrument.
Quality of EPA assessment by type of observation
Data analysis was conducted using G‐String IV with Genova Version 6.1.1 and IBM SPSS Statistics for Windows Version 23. A comparison of mean QuAL scores between the direct and indirect observation groups was conducted. QuAL scores were analyzed using a factorial analysis of variance (ANOVA) with mean QuAL score as the dependent variable and type of observation as the independent variable.
Reliability
The reliability of the QuAL score for EPA assessments was calculated using a generalizability analysis. In this model, individual EPA assessments were considered the object of measurement with condition (observation type) and trainee (1–17) treated as between subject factors. EPA assessments were nested within the trainee × condition (observation type) and crossed with physician rater (1–4). Mean QuAL score was the dependent measure. The variance components that resulted from these analyses were used to determine the reliability of the QuAL instrument.
A decision study was performed to determine the number of raters needed to produce a target reliability of 0.80. Variance components from the generalizability analysis were used to capture the reliability of the QuAL for EPA assessments with different numbers of physician raters. All research procedures were approved by the Ottawa Health Science Network Research Ethics Board.
RESULTS
A total of 1070 EPA assessments were collected by 17 trainees during the study period. From the data set, 244 EPA assessments (122 direct observation, 122 indirect observation) were selected for further analysis and scored independently by four raters using the QuAL instrument (Figure 2). EPA assessments were completed by 55 different supervisors for 17 trainees. The mean number of EPA assessments completed by supervisors was 4.4 (range 2–22). The mean number of EPA assessments completed for each trainee was 14.3 (range 2–32).
Quality of EPA assessment by type of observation
The mean (±SD) QuAL scores in the direct and indirect observation groups were 3.78 (±0.95) and 3.63 (±0.96), respectively. No significant difference was found between mean QuAL score and observation type (direct or indirect; F 1210 = 1.89; p = 0.17; Table 2).
TABLE 2.
Mean QuAL scores from final study pool rating exercise
| Physician rater | QuAL score, mean (±SD) | |
|---|---|---|
| Direct | Indirect | |
| Physician rater 1 | 4.03 (±1.05) | 4.05 (±1.08) |
| Physician rater 2 | 3.98 (±0.97) | 3.92 (±1.07) |
| Physician rater 3 | 3.77 (±1.04) | 3.49 (±1.07) |
| Physician rater 4 | 3.35 (±1.35) | 3.06 (±1.42) |
| Physician raters (combined) | 3.78 (±0.95) | 3.63 (±0.96) |
Reliability
The variance components derived from the generalizability analysis support the finding that there was no difference in QuAL score between observation types (Table 3). The majority of variance in QuAL scores (44%) was attributed to the quality of the individual EPA assessments. Facets involving raters (r, cr, tr, ctr) accounted for very little of the variability, which suggests that the four raters scored the EPA assessments in a similar manner. The reliability of the QuAL for EPA assessments was 0.72 with two raters, 0.79 with three raters, and 0.84 with four raters.
TABLE 3.
Variance components of generalizability study
| Facet | VC | Variance (%) | Explanation |
|---|---|---|---|
| c | 0.00 | 0 | Variance due to differences between conditions (observation type) |
| t | 0.16 | 11 | Variance due to differences between trainees |
| e:tc | 0.64 | 44 | Variance due to differences in the EPA assessment as a function of trainee and condition (observation type) |
| r | 0.14 | 9 | Variance due to differences between raters |
| tc | 0.00 | 0 | Variance due to trainee having different QuAL scores depending on the condition (observation type) |
| cr | 0.00 | 0 | Variance due to raters assigning different QuAL scores depending on the condition (observation type) |
| tr | 0.00 | 0 | Variance due to raters assigning different QuAL scores to each trainee |
| ctr | 0.02 | 1 | Variance due to raters assigning different QuAL scores to trainees by condition (observation type) |
| er:tc | 0.49 | 34 | Overall error |
Abbreviations: c, condition (observation type); e, EPA (entrustable professional activity); r, rater; t, trainee; VC, variance component.
DISCUSSION
The global adoption and implementation of CBME has placed increased emphasis on WBA and observation of trainees. 7 These observations form the foundation from which feedback and assessment are constructed and delivered to trainees. 25 As part of a robust program of assessment, documented observations are reviewed by competence committees who are ultimately responsible for the integration of a trainee's assessment data and decisions relating to trainee promotion and readiness for independent practice. 21 , 26
Despite calls for increased direct observation of trainees, many challenges and barriers have been described such as trainee apprehension around initiation of direct observation and the increased time and resources required. 11 , 15 These barriers have the potential to limit the implementation and daily use of direct observation within the clinical environment. Given these challenges, it is essential that medical educators fully understand what value direct observation adds during assessment of trainees in the workplace. In this study, direct observation of trainees did not improve the quality of EPA assessment documentation.
While EPA assessment documentation quality did not differ following direct and indirect observation, other differences between observation types may exist. For example, the narrative content within EPA assessments may differ with each type of observation highlighting a different aspect of performance for a particular EPA. Future work may consider narrative content analysis to determine whether these differences exist.
Trainee perceptions of the value of direct and indirect observation may also differ. Trainees make judgments regarding the credibility of feedback they receive on their clinical performance. 25 , 27 It is possible that trainees perceive one observation type to produce more credible feedback than the other. Future work investigating how trainees perceive feedback following direct and indirect observation may offer a more complete understanding of the value of each observation type within CBME.
Further, trainees may also receive different quality verbal feedback following direct and indirect observation of EPA assessments. Learning conversations that include verbal feedback are essential for trainee growth and development during medical training and are an important component of formative assessment in CBME. 25 , 28 In practice, supervisors and trainees often engage in conversation around the time an observation is made during which feedback is delivered. It remains unclear how these conversations are translated into EPA assessment documentation. Future work is necessary to determine whether the nondocumented conversations that occur following each type of observation differ.
In a CBME program, it is the competence committee that is ultimately responsible for making decisions relating to trainee promotion and readiness for independent practice. 21 , 26 This study suggests that competence committees are receiving EPA assessment documentation of the same quality irrespective of type of observation. Failure to document verbal feedback provided during learning conversations on EPA assessments results in a competence committee that is blinded to the true abilities of the trainee and limited in their capacity to make informed decisions related to trainee progression.
QuAL instrument reliability
From our generalizability analysis, the reliability of the QuAL instrument for EPA assessments was 0.84 with four raters. The subsequent decision study showed that with three raters, a reliability of 0.79 can be achieved. In a previously study by Chan et al., 23 the QuAL instrument required only two raters to reach a target reliability of >0.80 when rating the quality of written comments from a non‐EPA WBA database. Our study provides further evidence for the reliability of the QuAL instrument. Given our findings, future research using the QuAL instrument to rate quality of EPA assessment documentation should employ four raters to target a reliability >0.80. Given its brevity and ease of use, the QuAL instrument has many potential future applications including faculty development with the goal of improving the quality of EPA assessments.
LIMITATIONS
This study was conducted at a single institution within a single training program; thus generalizability to other programs should be considered carefully. Due to the transition of Canadian emergency medicine programs to CBME in 2018, this study involved two cohorts of CBME trainees who were in the foundations and core stages of training. During the study period, no trainees in the program had been promoted to the final stage of training. It is possible that EPA assessment documentation quality between observation types may be different in the last stage of training as trainees are provided with increased autonomy. Future work will seek to explore quality of EPA assessments more broadly within all stages of training and across multiple institutions.
Of the completed EPA assessments, 272 (25.4%) did not have the type of observation recorded. Limitations of the electronic EPA assessment forms required trainees to manually document what type of observation occurred within the narrative comment field. In cases where the type of observation was missing, it is believed that trainees simply forgot to record the observation type. Purposeful omission is unlikely as trainees were not aware of the study purpose and had no incentive to omit recording the observation type.
Finally, this study asked trainees to interpret and record the type of observation that occurred. It is possible that trainees may have had variable interpretation of what constitutes direct and indirect observation. An attempt to mitigate this limitation was made by providing trainees with a clear definitions and examples of each type of observation (Data S3). For observations that may have included a component of both direct and indirect observation, trainees were asked to record the type of observation that occurred for the majority of the EPA assessment.
CONCLUSIONS
An emerging tension within competency‐based medical education relates to the challenges associated with direct observation of trainees in the workplace. 13 , 14 , 25 Recognizing this tension, it is essential that medical educators ask themselves where the value of direct observation within a competency‐based medical education program resides. In this study, direct observation of trainees did not improve the quality of entrustable professional activity assessment documentation. Our findings raise questions about the implications for meta‐raters such as competence committees responsible for making judgments related to trainee promotion.
Supporting information
Data S1
Data S2
Data S3
Landreville JM, Wood TJ, Frank JR, Cheung WJ. Does direct observation influence the quality of workplace‐based assessment documentation?. AEM Educ Train. 2022;00:e10781. doi: 10.1002/aet2.10781
Supervising Editor: Dr. Daniel Egan
REFERENCES
- 1. Ten Cate O. Competency‐based postgraduate medical education: past, present and future. GMS J Med Educ. 2017;34(5):1‐13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Harris P, Bhanji F, Topps M, et al. Evolving concepts of assessment in a competency‐based world. Med Teach. 2017;39(6):603‐608. [DOI] [PubMed] [Google Scholar]
- 3. Gofton W, Dudek N, Barton G, Bhanji F. Workplace‐Based Assessment Implementation Guide: Formative Tips for Medical Teaching Practice. 1st ed. The Royal College of Physician and Surgeons of Canada; 2017:1‐12. [Google Scholar]
- 4. LaDonna K, Hatala R, Lingard L, Voyer S, Watling C. Staging a performance: learners' perceptions about direct observation during residency. Med Educ. 2017;51(5):498‐510. [DOI] [PubMed] [Google Scholar]
- 5. Melvin L, Cavalcanti R. The oral case presentation: a key tool for assessment and teaching in competency‐based medical education. JAMA. 2016;316(21):2187‐2188. [DOI] [PubMed] [Google Scholar]
- 6. Landreville J, Cheung W, Hamelin A, Frank J. Entrustment checkpoint: clinical supervisors' perceptions of the emergency department oral case presentation. Teach Learn Med. 2019;31(3):250‐257. [DOI] [PubMed] [Google Scholar]
- 7. Iobst W, Sherbino J, Ten Cate O, et al. Competency‐based medical education in postgraduate medical education. Med Teach. 2010;32(8):651‐656. [DOI] [PubMed] [Google Scholar]
- 8. Kogan J, Hatala R, Hauer K, Holmboe E. Guidelines: the do's, don'ts and don't knows of direct observation of clinical skills in medical education. Perspect Med Educ. 2017;6(5):286‐305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Norcini J, Burch V. Workplace‐based assessment as an educational tool: AMEE Guide No. 31. Med Teach. 2007;29(9–10):855‐871. [DOI] [PubMed] [Google Scholar]
- 10. Rietmeijer C, Huisman D, Blankenstein A, et al. Patterns of direct observation and their impact during residency: general practice supervisors' views. Med Educ. 2018;52(9):981‐991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Cheung W, Patey A, Frank J, Mackay M, Boet S. Barriers and enablers to direct observation of trainees' clinical performance: a qualitative study using the theoretical domains framework. Acad Med. 2019;94(1):101‐114. [DOI] [PubMed] [Google Scholar]
- 12. Watling C, LaDonna K, Lingard L, Voyer S, Hatala R. 'Sometimes the work just needs to be done': socio‐cultural influences on direct observation in medical training. Med Educ. 2016;50(10):1054‐1064. [DOI] [PubMed] [Google Scholar]
- 13. Holmboe E. Realizing the promise of competency‐based medical education. Acad Med. 2015;90(4):411‐413. [DOI] [PubMed] [Google Scholar]
- 14. Holmboe E. Faculty and the observation of trainees' clinical skills: problems and opportunities. Acad Med. 2004;79(1):16‐22. [DOI] [PubMed] [Google Scholar]
- 15. Madan R, Conn D, Dubo E, Voore P, Wiesenfeld L. The enablers and barriers to the use of direct observation of trainee clinical skills by supervising faculty in a psychiatry residency program. Can J Psychiatry. 2012;57(4):269‐272. [DOI] [PubMed] [Google Scholar]
- 16. Sherbino J, Bandiera G, Doyle K, et al. The competency‐based medical education evolution of Canadian emergency medicine specialist training. CJEM. 2020;22(1):95‐102. [DOI] [PubMed] [Google Scholar]
- 17. Ten Cate O. Nuts and bolts of entrustable professional activities. J Grad Med Educ. 2013;5(1):157‐158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Thoma B, Hall A, Clark K, et al. Evaluation of a national competency‐based assessment system in emergency medicine: a CanDREAM study. J Grad Med Educ. 2020;12(4):425‐434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Chan T, Sherbino J, Mercuri M. Nuance and noise: lessons learned from longitudinal aggregated assessment data. J Grad Med Educ. 2017;9(6):724‐729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Lockyer J, Carraccio C, Chan M, et al. Core principles of assessment in competency‐based medical education. Med Teach. 2017;39(6):609‐616. [DOI] [PubMed] [Google Scholar]
- 21. Hauer K, Chesluk B, Iobst W, et al. Reviewing residents' competence: a qualitative study of the role of clinical competency committees in performance assessment. Acad Med. 2015;90(8):1084‐1092. [DOI] [PubMed] [Google Scholar]
- 22. Rekman J, Gofton W, Dudek N, Gofton T, Hamstra S. Entrustability scales: outlining their usefulness for competency‐based clinical assessment. Acad Med. 2016;91(2):186‐190. [DOI] [PubMed] [Google Scholar]
- 23. Chan T, Sebok‐Syer S, Sampson C, Monteiro S. The Quality of Assessment of Learning (Qual) score: validity evidence for a scoring system aimed at rating short, workplace‐based comments on trainee performance. Teach Learn Med. 2020;32(3):319‐329. [DOI] [PubMed] [Google Scholar]
- 24. Landreville J, Frank J, Cheung W. Does direct observation happen early in a new competency‐based residency program? AEM Educ Train. 2021;5(2):e10591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Watling C, Ginsburg S. Assessment, feedback and the alchemy of learning. Med Educ. 2019;53(1):76‐85. [DOI] [PubMed] [Google Scholar]
- 26. Doty C, Roppolo L, Asher S, et al. How do emergency medicine residency programs structure their clinical competency committees? A survey. Acad Emerg Med. 2015;22(11):1351‐1354. [DOI] [PubMed] [Google Scholar]
- 27. Watling C, Driessen E, van der Vleuten C, Lingard L. Learning from clinical work: the roles of learning cues and credibility judgements. Med Educ. 2012;46(2):192‐200. [DOI] [PubMed] [Google Scholar]
- 28. Tavares W, Eppich W, Cheng A, et al. Learning conversations: an analysis of the theoretical roots and their manifestations of feedback and debriefing in medical education. Acad Med. 2020;95(7):1020‐1025. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1
Data S2
Data S3
