Abstract
Objectives
A lifelong pursuit such as medicine is frequently paired with a framework of “deliberate practice” for improvement. It is unclear whether the quality of feedback varies across different learner levels. Our study aims to assess whether a difference exists in the quality of feedback delivered to high‐, expected‐, and below‐expected performer emergency medicine (EM) residents based on their attending‐identified performance level.
Methods
We conducted a retrospective review of written EM resident feedback collected between November 2018 and March 2021. Clinical performance level was subjectively determined by attending faculty in their feedback. Feedback was coded on a scale from 0–5 based on the presence (1) or absence (0) of the items modified from the Ende's SMART criteria: Specific (S), Measurable (M), Achievable (A), Relevant (R), and Time‐bound (T). The primary outcome was any total modified SMART criteria score difference concerning performance level using logistic regression with Generalized Estimating Equations (GEE). Secondary outcomes were differences for individual criteria.
Results
We analyzed 1284 evaluations (311 high performers, 930 expected performers, and 43 below‐expected performers) of 94 unique residents from 66 different evaluators. Mean total modified SMART scores were significantly higher in high and below‐expected performers than those designated as expected performers by faculty evaluators. Achievable and Relevant written feedback was provided to high performers in a significantly larger proportion than expected and below‐expected performers. Only 278 out of 1284 evaluations met criteria for Specific.
Conclusions
Mean total modified SMART feedback scores were significantly greater in high performers and below‐expected performers when compared to expected performers. Achievable and Relevant feedback was provided in greater proportions to high performer residents compared to expected and below‐expected performers. These findings are a challenge to academic faculty to engage in quality feedback delivery for residents at all performance levels.
INTRODUCTION
Much has been previously written about feedback dynamics, and the graduate medical education community has placed a great deal of emphasis on the remediation of below‐expected clinical performance of residents. 1 , 2 , 3 , 4 , 5 , 6 Comparatively, little attention has been paid to the evaluation of expected‐ and high‐performing emergency medicine (EM) residents. 7 , 8 , 9 The conceptual framework of “deliberate practice,” wherein learners of all skill levels can improve their clinical performance by incorporating immediate feedback, is the core idea for framing our research question. 10 , 11 , 12 , 13 , 14 While there are data on the “halo effect” for assessment, there is a need to understand better how and if this finding translates into feedback. 15 , 16 A literature review performed found no prior work designed to evaluate the quality of feedback based on the performance ranking of a graduate medical learner in EM. This work is the first to query qualitative difference in written feedback for residents with respect to attending‐identified clinical performance level. Previous work by this study group showed weak‐to‐moderate positive correlations between ACGME milestone ratings and percentages of attending‐designated “high performer” ratings. 17 Based on the emphasis on below‐expected performers in the existing literature, and collective anecdotal author experiences, we hypothesized that the below‐expected performer residents would have higher feedback quality scores than those identified as performing at expected or high performer levels when reviewing written feedback comments.
METHODS
Study design
This retrospective study compared written feedback content for EM residents grouped by their attending‐identified performance level. Participants were de‐identified and were not aware of a study taking place a priori. The study was deemed exempt by the (blinded) institutional review board. No funding or sponsorship supported this study.
Study setting and population
Written feedback was collected during a 28‐month consecutive convenience sampling period (November 2018–March 2021) at a four‐year EM residency at a quaternary care hospital. The institution has 59 total residents (14–17 residents per class). There were no exclusion criteria, and all residents had at least one evaluation.
Study protocol
A mobile‐friendly Qualtrics™ (Qualtrics, Provo, UT) tool was used to collect post‐shift feedback at the study site since 2017 and was updated to include a question about performance level at the beginning of the study period. 18 EM residents were subjectively designated by faculty evaluators as belonging in one of three categories: high‐, expected‐, and below‐expected performer. Evaluators were not provided with specific criteria for these categories. Faculty are expected to provide feedback in person, initiated by either party, with the mobile application open. The written feedback can also be entered later in time and is regularly reviewed by advisors, program leadership, the clinical competency committee (CCC). The resident and their advisors have on‐demand access as well.
The two coders (RC, AT) consisted of one male Assistant Program Director and one female Associate Program Director (APD). Written feedback was coded based on the presence (1) or absence (0) of five modified Ende's SMART criteria (Table 1): Specific (S), Measurable (M), Achievable (A), Relevant (R), and Time‐bound (T). 19 Coders deliberately considered their perceptions and biases in an exercise of reflexivity. Fifteen percent of the data were coded independently by the two coders and then again jointly via discussion as training. The coders then reviewed the training coding results with the full study group to iteratively refine and resolve any discrepancies and as further training before continuing. An additional 10% of overlapping data were then coded independently by the two authors. After this training period, the two coders independently coded the remainder of data. Each evaluation from an attending was coded on the presence (1) or absence (0) of the SMART criteria such that every entry received five separate scores. If one entry had multiple pieces of feedback, they were coded together. If the same attending entered a subsequent evaluation of the same resident on a different occasion, that was scored separately.
TABLE 1.
Modified SMART score coding dictionary
| Specific (S) |
| Includes details of setting OR |
| Not generalizable OR |
| Detailed advice of specific scenario, or type of patient, or diagnosis OR |
| Residents would be able to pinpoint which case that may have been in reference to |
| Measurable (M) |
| Must be something that can change between this evaluation and the next AND |
| A binary item, e.g., did or did not complete an item OR |
| Can be literally and practically quantitatively assessed numerically |
| Achievable (A) |
|
Needs to be something remediable (i.e., feedback regarding changing one's personality is not Achievable) |
| Relevant (R) |
|
Something that should be expected of an emergency medicine resident at their level of training |
| Time‐bound (T) |
| Guidance must have a deadline or timeline explicitly mentioned |
Measurements
The primary outcome was the difference in total modified Ende's SMART scores concerning attending‐identified performance levels. 19 Secondary outcomes were based on differences in individual criteria.
The data consist of multiple observations of the same residents and multiple observations from the same evaluators. It is quite likely that scores from the same resident would tend to be more similar than scores from other residents and, likewise, scores from the same evaluator tend to be more alike than scores from different evaluators. As traditional methods of analysis assume independence, Generalized Estimating Equations (GEE) were used to correct for correlation from this clustering that could confound differences observed between performance levels. As the correlation of modified SMART scores was much higher between repeated observations from evaluators compared to repeated observations of residents, the GEE analysis clustered on evaluators. In the GEE, a negative binomial distribution was used for the total SMART score while a binomial distribution was used for individual components.
Data analysis
A total of 2001 evaluations were available during the study period, including 43 below‐expected, 1647 expected, and 311 high performers. To represent all demographic groups in the setting of a predominantly white group of residents and faculty, all entries from non‐white evaluators as well as from non‐white high or below‐expected performers were included given their low frequency relative to white counterparts. A random sample of expected performers from white evaluators was taken to make the total sample size. Because expected performance was present in a significantly higher number than other categories, a total of 930 expected performers was sampled in order to provide a 3:1 ratio between expected and high performers.
Model‐predicted proportions, as well as both unadjusted and adjusted comparisons (adjusted for resident gender, resident race, evaluator gender, and evaluator race) between high‐, expected‐, and below‐expected performer residents are presented. Statistical analyses were performed using SAS Software (Version 9.4 M7, SAS Institute, NC). Power calculations to determine the detectable effect size between high and expected performers were calculated using PASS (v.2019, Kaysville, UT).
RESULTS
A total of 1284 evaluations (311 high performers, 930 expected performers, and 43 below‐expected performers) from 94 unique residents over the 28 months were included in the analysis. The median number of evaluations for a resident was 13 (IQR = 6–20, range 1–34). Sixty‐six unique evaluators provided a median of eight evaluations (IQR = 3–27, range 1–181). Demographic information of both residents and evaluators was recorded (Table 2). Blank evaluations were coded as zero.
TABLE 2.
Demographic characteristics of residents and evaluators
| Total n (%) | Expected n (%) | Below‐expected n (%) | High performer n (%) | |
|---|---|---|---|---|
| Resident Female Gender | 496 (38.6) | 342 (36.8) | 15 (34.9) | 139 (44.7) |
| Resident Race | ||||
| Native American, American Indian | 17 (1.3) | 12 (1.3) | 0 (0) | 5 (1.6) |
| Asian | 345 (26.9) | 257 (27.6) | 19 (44.2) | 69 (22.2) |
| Black | 146 (11.4) | 100 (10.8) | 2 (4.7) | 44 (14.2) |
| White | 776 (60.4) | 561 (60.3) | 22 (51.2) | 193 (62.1) |
| Evaluator Female Gender | 433 (33.7) | 291 (31.3) | 14 (32.6) | 128 (41.2) |
| Evaluator Race/Ethnicity | ||||
| Asian | 253 (19.7) | 194 (20.9) | 5 (11.6) | 54 (17.4) |
| Black | 47 (3.7) | 33 (3.6) | 4 (9.3) | 10 (3.2) |
| Hispanic, Latino, Spanish | 186 (14.5) | 178 (19.1) | 6 (14.0) | 2 (0.6) |
| White | 738 (57.5) | 467 (50.2) | 28 (65.1) | 243 (78.1) |
| Other | 60 (4.7) | 58 (6.2) | 0 (0) | 2 (0.6) |
Note: n: number.
The kappa statistic for all categories was >0.8, with individual kappa values as follows: Specific κ 1.000; Measurable κ 0.971; Achievable κ 1.000; and Relevant κ 1.000. Time‐bound was unable to be calculated, as only 0 s were given by both coders for all entries reviewed in this subset.
Mean total modified SMART scores averaged less than three out of five in all categories. Total SMART scores were significantly higher in high performers and below‐expected performers compared to those designated as expected performers by faculty evaluators. Pairwise comparisons show that the Achievable and Relevant feedback was provided to high performers in a significantly larger proportion than to expected and below‐expected performers (Table 3). No statistical differences were found in other categories.
TABLE 3.
Modified SMART scores for high performers, expected, and below‐expected residents.
| Unadjusted a (Least square mean or n [%]) | Unadjusted b GEE (Model‐derived proportions) | Adjusted c GEE (Model‐derived proportions) | |
|---|---|---|---|
| Mean SMART score | p < 0.001 | p < 0.001 | |
| High performer | 2.72 | 2.72 d | 2.58 d |
| Expected | 2.21 | 2.21 | 2.19 |
| Below‐expected | 2.70 | 2.68 d | 2.55 d |
| Specific (S) | p = 0.96 | p = 0.98 | |
| High performer | 84 (27.0) | 28.5 | 26.8 |
| Expected | 186 (20.0) | 28.3 | 26.5 |
| Below‐expected | 8 (18.6) | 30.3 | 27.8 |
| Measurable (M) | p = 0.34 | p = 0.46 | |
| High performer | 144 (46.3) | 48.2 | 49.5 |
| Expected | 360 (38.7) | 42.2 | 43.9 |
| Below‐expected | 23 (53.5) | 51.4 | 51.6 |
| Achievable (A) | P = 0.01 | p = 0.01 | |
| High performer | 308 (99.0) | 99.0 d | 99.2 d |
| Expected | 754 (81.1) | 88.3 | 89.3 |
| Below‐expected | 42 (97.7) | 98.6 | 98.5 |
| Relevant (R) | P = 0.01 | P = 0.01 | |
| High performer | 308 (99.0) | 99.0 d | 99.2 d |
| Expected | 753 (81.0) | 88.2 | 89.2 |
| Below‐expected | 42 (97.7) | 98.6 | 98.5 |
| Time‐bound (T) | p = 0.34 | Insufficient number of events | |
| High performer | 2 (0.6) | 0.4 | |
| Expected | 1 (0.1) | 0.1 | |
| Below‐expected | 1 (2.3) | 2.3 |
Note: n, number.
GEE, Generalized Estimating Equations.
Crude summaries without adjustment for clustering or demographic characteristics. For total SMART score, exponentiated least squares mean from a negative binomial regression model are presented.
From GEE model accounting for clustering by evaluator.
From GEE model accounting for clustering by evaluator and adjustment for demographic characteristics (resident gender, resident race, evaluator gender, evaluator race).
Denotes that the proportion was significantly higher than expected performer.
DISCUSSION
In this study, we found a statistically significant written feedback gap for residents deemed to be performing at expected clinical levels by supervising faculty. This clinical performance group also comprised the majority of residents (72.4%) during the study period.
A faculty member may truly feel neutral‐to‐positive regarding residents meeting expectations and, therefore, feel less internally motivated or less able to provide targeted feedback. The study group suspects that there could be an event or instance that brings faculty to a decision that residents have deviated from expected performance, allowing more detailed discussion surrounding the answer to the question “Why did this resident perform below/above your expectations?” Greater consistency for evaluators could also suggest that certain faculty regularly write higher scoring comments than others.
Only 278 out of 1284 evaluations met criteria for Specific content of feedback. These findings should serve as a challenge to academic faculty to actively provide feedback to residents at all performance levels and has served as a needs assessment for the study site regarding faculty feedback professional development. A 2018 faculty retreat focused on feedback “best practices,” and our results suggest there is still room for growth. The authors speculate whether there is a role of “protective hesitancy” leading to different groups receiving distinct types of feedback: are individuals who provide feedback hesitating with respect to giving formative feedback to avoid hurting the feelings or self‐esteem of those receiving feedback? 20 , 21 , 22 The “real world” significance of receiving only certain SMART items is also unclear (e.g., what is the significance of receiving Achievable, but not Specific, feedback?). Further work is needed to help answer these questions.
This is the first study to date that has examined differences in written feedback for residents at various levels across the spectrum of clinical performance. Future research should define performance with qualitative descriptors, perhaps using thematic analysis from a similar feedback collection mechanism to the one studied here. Though a daunting task at face value, the consensus milestones and graduated descriptive language were similarly developed. Further work may also delve into any possible demographic differences and thematic analyses.
LIMITATIONS
Our study had several limitations. The most significant limitation is that high, expected, or below‐expected level was not defined for the attending evaluators. These ratings were not defined as there is no standard definition, and individual attendings could subjectively choose these levels based on their personal experience as an attending and educator. This study was conducted at a single academic institution with a large EM residency program and may not be generalizable to all institutions. Faculty at the study site are inconsistent with providing written feedback after every shift: The number of times faculty fill out the evaluation form is tracked as a performance measure by departmental leadership, and participation can vary from 0 to 15 entries per month per faculty, many of whom have varying degrees of interest in education and varying degrees of clinical shift load. Nonetheless, the feedback culture is strong and frequently reviewed at the study institution, and results could differ at other institutions without this culture. The written feedback may not fully reflect or align with verbal feedback given. Coding remained at the discretion of subjective interpretation of text. We attempted to address this by developing consensus and establishing strong interrater reliability with a kappa >0.8. The number of below‐expected performers in our data set was low, limiting the findings for this cohort. While a mathematical difference was found, it is unclear whether this difference is practically meaningful for learners.
CONCLUSIONS
Mean total modified SMART feedback scores were significantly greater in high performers and below‐expected performers when compared to expected performers. Achievable and Relevant feedback was provided in greater proportions to high performer residents compared to expected and below‐expected performers. These findings are a challenge to academic faculty to engage in quality feedback delivery for residents at all performance levels.
AUTHOR CONTRIBUTIONS
RC: concept origin, drafting of manuscript, data coding, corresponding author. AT: concept origin, data coding, figure creation. MG: critical revision of manuscript, methodology advisor. JB: critical revision of manuscript, lit search. RB: critical revision of manuscript. JD: statistician, figure/table creation. DD: concept origin, critical revision of manuscript. KG: concept origin, critical revision of manuscript, lit search.
CONFLICT OF INTEREST
No conflicts of interest to report.
Coughlin RF, Tsyrulnik A, Gottlieb M, et al. Differences in faculty feedback for high, expected, and below‐expected clinically performing emergency medicine residents. AEM Educ Train. 2022;6:e10788. doi: 10.1002/aet2.10788
Supervising Editor: Dr. Esther Chen.
Presentations: Early iteration of this work was presented at Virtual ACEP October 2020.
REFERENCES
- 1. McGhee J, Crowe C, Kraut A, et al. Do emergency medicine residents prefer resident‐initiated or attending‐initiated feedback? AEM Educ Train. 2017;1:15‐20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Mueller AS, Jenkins TM, Osborne M, Dayal A, O'Connor DM, Arora VM. Gender differences in attending physicians' feedback to residents: a qualitative analysis. J Grad Med Educ. 2017;9:577‐585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Yarris LM, Linden JA, Gene Hern H, et al. Attending and resident satisfaction with feedback in the emergency department. Acad Emerg Med. 2009;16:S76‐S81. [DOI] [PubMed] [Google Scholar]
- 4. Egan DJ, Gentges J, Regan L, Smith JL, Williamson K, Murano T. An emergency medicine remediation consult service: access to expert remediation advice and resources. AEM Educ Train. 2019;3:193‐196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Murano T, Smith JL, Weizberg M. Remediation strategies for emergency medicine patient care milestones. Cureus. 2018;10:e3557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Nadir NA, Hart D, Cassara M, et al. Simulation‐based remediation in emergency medicine residency training: a consensus study. West J Emerg Med. 2019;20:145‐156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Nemani VM, Park C, Nawabi DH. What makes a "great resident": the resident perspective. Curr Rev Musculoskelet Med. 2014;7:164‐167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Pines JM, Alfaraj S, Batra S, et al. Factors important to top clinical performance in emergency medicine residency: results of an ideation survey and Delphi panel. AEM Educ Train. 2018;2:269‐276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Reed DA, West CP, Mueller PS, Ficalora RD, Engstler GJ, Beckman TJ. Behaviors of highly professional resident physicians. JAMA. 2008;300:1326‐1333. [DOI] [PubMed] [Google Scholar]
- 10. ACGME Common Program Requirements (Residency) . 2021 Accreditation Council for Graduate Medical Education (ACGME); 2020.
- 11. Buckley C, Natesan S, Breslin A, Gottlieb M. Finessing feedback: recommendations for effective feedback in the emergency department. Ann Emerg Med. 2020;75:445‐451. [DOI] [PubMed] [Google Scholar]
- 12. Della‐Giustina D, Kamran A, Wood DB, Goldflam K. Resident self‐assessment and the deficiency of individualized learning plans in our residencies. West J Emerg Med. 2020;22:33‐36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Ericsson KA. Deliberate practice and the acquisition and maintenance of expert performance in medicine and related domains. Acad Med. 2004;79:S70‐S81. [DOI] [PubMed] [Google Scholar]
- 14. Ericsson KA, Pool R. Peak: how all of us can achieve extraordinary things. Vintage; 2017. [Google Scholar]
- 15. Natesan SM, Krzyzaniak SM, Stehman C, Shaw R, Story D, Gottlieb M. Curated collections for educators: eight key papers about feedback in medical education. Cureus. 2019;11:e4164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Sherbino J, Norman G. On rating angels: the halo effect and straight line scoring. J Grad Med Educ. 2017;9:721‐723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Coughlin RF, Della‐Giustina D, Tsyrulnik A, et al. 278 Identifying high performer residents in emergency medicine training. Ann Emerg Med. 2020;76:S106‐S107. [Google Scholar]
- 18. Harrison R, Tsyrulnik A, Wood DB, Coughlin RF, Della‐Giustina D, Goldflam K. An innovative feedback tool leading to improved faculty feedback and positive reception by residents. West J Emerg Med. 2019;21:47‐51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Ende J. Feedback in clinical medical education. Jama. 1983;250:777‐781. [PubMed] [Google Scholar]
- 20. Ramani S, Könings KD, Ginsburg S, van der Vleuten CPM. Twelve tips to promote a feedback culture with a growth mind‐set: swinging the feedback pendulum from recipes to relationships. Med Teach. 2019;41:625‐631. [DOI] [PubMed] [Google Scholar]
- 21. Washington ZRL. Women of color get less support at work. Here's how managers can change that. Harvard Business Review. 2019;4:2019. [Google Scholar]
- 22. Correll SSC. Research: vague feedback is holding women Back. Harv Bus Rev. 2016;29:2016. [Google Scholar]
