STRUCTURED ABSTRACT:
Objective:
To determine differences in Entrustable Professional Activity (EPA) assessments between male and female general surgery residents.
Summary Background Data:
Evaluations play a critical role in career advancement for physicians. However, female physicians in training receive lower evaluations and underrate their own performance. Competency-based assessment frameworks, such as EPAs, may help address gender bias in surgery by linking evaluations to specific, observable behaviors.
Methods:
In this cohort study, EPA assessments were collected from July 2018 to May 2020. The effect of resident gender on EPA entrustment levels was analyzed using multiple linear and ordered logistic regressions. Narrative comments were analyzed using Latent Dirichlet Allocation (LDA) to identify topics correlated with resident gender.
Results:
Of the 2,480 EPAs, 1,230 EPAs were submitted by faculty and 1,250 were submitted by residents. After controlling for confounding factors, faculty evaluations of residents were not impacted by resident gender (estimate = 0.09, p = 0.08). However, female residents rated themselves lower by 0.29 (on a 0–4 scale) compared to their male counterparts (p < 0.001). Within narrative assessments, topics associated with resident gender demonstrated that female residents focus on the ‘guidance’ and ‘supervision’ they received while performing an EPA, while male residents were more likely to report ‘independent’ action.
Conclusions:
Faculty assessments showed no difference in EPA levels between male and female residents. Female residents rate themselves lower by nearly an entire PGY level compared to male residents. LDA-identified topics suggest this difference in self-assessment is related to differences in perception of autonomy.
MINI-ABSTRACT:
This study looked at differences in Entrustable Professional Activities (EPA) assessments between male and female general surgery residents by analyzing both the numeric EPA scores and the narrative feedback from residents and faculty.
INTRODUCTION:
Despite a recent increase in the number of women entering the medical field and pursuing surgical specialties, there is still evident inequality in terms of leadership roles, rank attainment, and salaries for female phsyicains.1,2,3 Evaluations can play a significant role in both rank attainment and career advancement, and implicit gender bias in evaluations can act as a barrier for female physicians.
Both qualitative and quantitative studies addressing resident evaluations demonstrate an implicit gender bias in medicine. Studies have reported differences in the words used to describe female and male physicians in evaluations and recommendation letters.4–5 Males were described with more positive and standout words, such as “leader” or “excellent,” compared to their female counterparts.4 Not only are there disparities in the words used to describe male and female physicians, but female physicians receive more inconsistent qualitative feedback.5 Studies have also shown that female physicians are more likely to receive lower performance metrics in their evaluations.5–6 Additionally, women throughout various stages in their medical training tend to rate themselves lower than males, despite demonstrating comparable performances.7–8 The gender bias seen in physician evaluations, including the trend in lower female self-evaluations, may be hindering advancement and professional growth for female surgeons.
Investigation of gender disparities in performance evaluations is essential. Entrustable Professional Activities (EPAs) are a novel form of competency based medical education that evaluate specific, observable clinical activities (such as a surgical consultation for other health care professionals) and provide an entrustability roadmap with defined behavioral anchors at each level of entrustment.9 EPAs have specific, measurable actions and competencies, potentially alleviating faculty subjectivity and bias.10 The EPAs are administered in the form of microassesments, or brief, focused evaluations designed for immediate feedback and direction. As EPAs are poised to become more widely used in general surgery, it is critical to analyze gender disparities and bias in EPA microassessments. Although studies in the past have examined gender differences in resident evaluations, to the authors’ knowledge, no study has analyzed the gender differences in both quantitative and narrative feedback of general surgery EPAs. We hypothesized that female residents would rate their performance lower than faculty ratings and receive lower faculty evaluations compared to male residents. Additionally, we hypothesized that there would be differences in the narrative feedback received by male and female residents.
METHODS:
Data Collection:
The EPA implementation strategy used at our institution involves a custom mobile application to collect EPA entrustment levels for a unique clinical encounter and free-text feedback from both faculty and residents. Prior to initiation, both faculty and residents underwent development interventions on EPA utilization. Immediately following performance of one of the five general surgery EPAs (Gallbladder Disease, Right Lower Quadrant Pain, Inguinal Hernia, General Surgery Consultation, Trauma Evaluation), faculty and residents complete an assessment on the mobile application. The EPAs characterize autonomy on a scale of 0–4, ranging from “Observation only” (0) to “Can teach others” (4). Both the resident and faculty cannot see the results of the other’s assessment until completing their own assessment. All EPA assessment data is prospectively collected by the mobile application. The American Board of Surgery EPA Pilot has a central data repository, but it is limited in scope. No granular data is submitted centrally, and different programs collected data using different tools. The central repository contains only summative entrustment decisions determined every six months by the clinical competency committee, while this analysis utilizes each individual EPA submitted by a resident or faculty member, along with required free text comments. This data is only available at the single-institution level.
Entrustment Level Analysis:
All EPA assessments submitted for non-general surgery residents were excluded. All EPA assessments submitted by faculty outside the departments of general surgery or emergency medicine faculty were excluded. The EPA assessments were divided into groups by assessor type (faculty vs. resident self-assessment), then compared by resident gender. Phase was defined as intraoperative (intraoperative assessment of cholecystectomy, inguinal hernia repair, or appendectomy) or non-operative (all trauma evaluations, all consultations, and pre- or post-operative management of patients with gallbladder disease, right lower quadrant pain, or inguinal hernias). Entrustment levels are defined as such: Observation only (Level 0), Direct supervision (Level 1), Indirect supervision (Level 2), Can supervise others (Level 3), and Can teach others (Level 4).
EPA entrustment levels are ordinal data, resembling a Likert scale with categories from 0–4. While ordinal data may violate the linear and normality assumptions of linear regression, numerous empirical studies have shown that t-tests and linear regression are robust to these violations, particularly in large samples sizes.11,12 Bivariate comparison between male and female residents was performed using the chi-square test for categorical variables and student’s t-test for continuous variables and EPA levels. A p-value of <0.05 was deemed sufficient to reject the null hypothesis.
Multiple linear regression models were used to analyze the effect of resident gender on EPA entrustment levels, while controlling for faculty gender, PGY of training, EPA type, faculty department, faculty years of experience, and phase of care. Two models were created; one for faculty submitted assessments and one for resident-submitted self-assessments. Following model fit, residual vs. fitted plots confirmed a linear relationship between predictors and outcomes, quantile-quantile plots confirmed normally distributed residuals, scale-location plots confirmed homogeneity of variance, and residual vs leverage plots did not identify any problematically influential outliers. Despite these reassuring linear model diagnostics, ordered logistic regressions were also performed for each dataset (faculty only assessments and resident self-assessments) to assess if modeling EPA levels as an ordinal variable would greatly impact results. The proportional odds assumption was analyzed using Brant’s test.13 Brant’s test for the resident gender variable was nonsignificant for both models, suggesting that the proportional odds assumptions held for the primary variable of interest.
However, the overall Brant’s test did not support proportional odds for the full resident assessment model. Therefore, multinomial logit models were fit to both datasets, to see if they meaningfully changed the results relative to the ordered logistic regression. While the multinomial model was associated with the lowest Akaike information criterion (AIC) (implying better model fit) compared to the other resident self-assessment models (linear regression, ordered logistic regression, multinomial logit regression: AIC 2280 > 2241 > 2223) this was not the case for the faculty assessment models. In accordance with the Brant’s test results that supported the proportional odds assumption for the faculty assessment model, the ordered logistic regression performed the best (linear regression, ordered logistic regression, multinomial logit regression: AIC 1963 > 1880 < 1904). Most importantly, across all three model specifications, the coefficients related to the impact of resident gender on EPA assessment levels remained similar in terms of direction and magnitude. Therefore, for ease of interpretability, the linear regression model makes up the primary quantitative results reported in this analysis, with an additional figure showing the results of the proportional odds regression to highlight the similarities in model findings.
Narrative Analysis:
Latent Dirichlet Allocation (LDA) topic modeling was used to identify latent themes or “topics” within the narrative comments associated with assessments. LDA topic modeling is a commonly used unsupervised machine-learning algorithm that identifies latent topics by analyzing probability distributions of word frequencies and the structure of words within documents.14 LDA is a ‘bag-of-words’ model, so the topics generated by this algorithm consist of word lists that are statistically related. The LDA algorithm uses a Dirichlet distribution to allocate words present in documents into topics.15,16 A simplified view of the allocation process is as follows: the algorithm randomly assigns every word in the document to one of k topics (the number of topics, k, is selected by the researcher). In this study, k=2. The algorithm then “throws out” the topic for a single word and re-calculates the predicted topic for that word based on the distribution of topics among all other words. This word is reassigned to that predicted topic, and the process is repeated for every word in the document. This process is then repeated thousands of times until the estimated topic distributions reach a steady state (the words stop getting assigned to new topics).17
Two metrics that guide interpretation of LDA topic modeling are term score and gamma (γ). Term score is a tf-idf (short for “term frequency- inverse document frequency”) rank which reflects the importance of terms within a topic. Term scores are used to rank words within an LDA topic: words with higher term scores are at the top of the list and have the strongest association with that topic. Gamma (γ) in LDA topic modeling is a measure of the probability of a document belonging to a particular topic.18 For example, if the document “female resident comments” has a gamma of 1.0 for “LDA Topic 1,” and a gamma of 0 for the remaining topics, then the document “female resident comments” is 100% composed of the LDA Topic 1. In this analysis, gamma scores were used to determine the quality of the match between the two identified topics and resident gender (Supplemental Figure 1).
For topic modeling, EPA assessment comments were divided into separate datasets by assessor type (faculty evaluation vs. resident self-evaluation) and PGY level. As previously described, within each dataset (e.g. faculty evaluations submitted for PGY-1 level residents), LDA was performed to identify two latent topics in the corpus.17 If the topics assigned by LDA correlated well with resident gender differences (determined by having a gamma score of ~1.0 for a particular LDA topic), then the word lists representing the latent topics for each gender were subsequently interpreted by the researcher. The entire process—document selection, LDA modeling, topic::document comparison—is described in detail in Supplemental Figure 1. Bolded terms were used in the table to highlight terms associated with resident autonomy that differed between male and female residents at that PGY level. Conversely, gamma scores ~0.5 suggested that the two topics were evenly split among gender groups and did not offer meaningful insights into gender differences. These topics were not further analyzed, and are combined in a single cell in the table to highlight the inability of those topics to inform gender differences.
This project was reviewed by the institutional Health Sciences IRB and certified as exempt from formal review. All statistical analysis and data visualizations were performed using R 4.0.0 (R Foundation for Statistical Computing, Vienna, Austria).
RESULTS:
Entrustment Levels – Demographics and Univariate Analysis:
Over 23 months, 2,480 EPAs populated the final dataset. There were 1,250 resident self-assessments submitted by 55 residents (23 male, 32 female) and 1,230 faculty assessments submitted by 80 faculty (50 male, 30 female). Differences in the characteristics of faculty assessments submitted for male and female residents are described in Table 1. There were significant baseline differences in EPA type, PGY level, and phase of care. There was no statistically significant difference in EPA entrustment levels between the male and female residents when they were evaluated by faculty (2.7 vs.2.8, p = 0.58), (Table 1).
Table 1:
Characteristics of EPA Assessments Submitted by Faculty
| Resident Gender | |||
|---|---|---|---|
| Female (N=718) | Male (N=512) | P value* | |
| Faculty Gender | 0.36 | ||
| Female | 185 (25.8%) | 144 (28.1%) | |
| Male | 533 (74.2%) | 368 (71.9%) | |
| EPA Type | < 0.01 | ||
| Gallbladder Disease | 192 (26.7%) | 91 (17.8%) | |
| GS Consultation | 110 (15.3%) | 102 (19.9%) | |
| Inguinal Hernia | 98 (13.6%) | 74 (14.5%) | |
| RLQ Pain | 164 (22.8%) | 90 (17.6%) | |
| Trauma | 154 (21.4%) | 155 (30.3%) | |
| Phase of Care | < 0.01 | ||
| Intraoperative | 335 (46.7%) | 176 (34.4%) | |
| Non-operative | 383 (53.3%) | 336 (65.6%) | |
| PGY Level | < 0.01 | ||
| 1 | 47 (6.5%) | 29 (5.7%) | |
| 2 | 123 (17.1%) | 79 (15.4%) | |
| 3 | 109 (15.2%) | 86 (16.8%) | |
| 4 | 202 (28.1%) | 269 (52.5%) | |
| 5 | 237 (33.0%) | 49 (9.6%) | |
| Faculty Department | 0.43 | ||
| Emergency Medicine | 204 (28.4%) | 156 (30.5%) | |
| General Surgery | 514 (71.6%) | 356 (69.5%) | |
| Faculty Years Experience | 10.5 (9.3) | 9.4 (9.1) | 0.10 |
| EPA Entrustment Level ** | 2.7 (1.0) | 2.8 (0.9) | 0.58 |
P values were determined using chi-square test for categorical variables and student’s t-test for continuous variables.
Level 0 – Observation Only, Level 1- Direct Supervision, Level 2 – Indirect Supervision, Level 3 – Can Supervise Others, Level 4 – Can Teach Others
Differences in the characteristics of resident self-assessments are described in Table 2. There were similar significant baseline differences in EPA type, PGY level, phase of care, and faculty department. However, the EPA entrustment levels in the resident self-assessments were significantly lower for female residents than for male residents (2.0 vs.2.5, p < 0.01), (Table 2). When directly comparing resident self-assessments and faculty evaluations, both female and male residents underrated their performance relative to faculty entrustment levels, but female residents underrated their performance to a greater degree. The mean EPA entrustment levels assigned to male residents were 2.77 for faculty evaluations and 2.52 for resident self-evaluations (p < 0.001). The mean EPA entrustment levels for female residents were 2.73 for faculty evaluations and 2.03 for resident self-evaluations (p < 0.001). This same trend was found in a sensitivity analysis of a subset of matched-pair assessments where 1,680 matched EPA assessments from the same clinical encounters were analyzed.
Table 2:
Characteristics of EPA Resident Self- Assessments
| Resident Gender | |||
|---|---|---|---|
| Female (N=577) | Male (N=673) | P value* | |
| Faculty Gender | 0.56 | ||
| Female | 200 (34.7%) | 244 (36.3%) | |
| Male | 377 (65.3%) | 429 (63.7%) | |
| EPA Type | < 0.01 | ||
| Gallbladder Disease | 174 (30.2%) | 146 (21.7%) | |
| GS Consultation | 54 (9.4%) | 124 (18.4%) | |
| Inguinal Hernia | 87 (15.1%) | 93 (13.8%) | |
| RLQ Pain | 194 (33.6%) | 155 (23.0%) | |
| Trauma | 68 (11.8%) | 155 (23.0%) | |
| Phase of Care | < 0.01 | ||
| Intraoperative | 362 (62.7%) | 285 (42.3%) | |
| Non-operative | 215 (37.3%) | 388 (57.7%) | |
| PGY Level | < 0.01 | ||
| 1 | 78 (13.5%) | 23 (3.4%) | |
| 2 | 143 (24.8%) | 143 (21.2%) | |
| 3 | 145 (25.1%) | 144 (21.4%) | |
| 4 | 103 (17.9%) | 300 (44.6%) | |
| 5 | 108 (18.7%) | 63 (9.4%) | |
| Faculty Department | 0.01 | ||
| Emergency Medicine | 70 (12.1%) | 116 (17.2%) | |
| General Surgery | 507 (87.9%) | 557 (82.8%) | |
| Faculty Years Experience | 9.6 (8.2) | 9.8 (8.2) | 0.61 |
| EPA Entrustment Level ** | 2.0 (0.9) | 2.5 (0.9) | < 0.01 |
P values were determined using chi-square test for categorical variables and student’s t-test for continuous variables.
Level 0 – Observation Only, Level 1- Direct Supervision, Level 2 – Indirect Supervision, Level 3 – Can Supervise Others, Level 4 – Can Teach Others
Entrustment Levels – Multivariable Analysis
After controlling for faculty gender, PGY level, EPA type, faculty department, faculty years of experience, and phase of EPA, multiple linear regression analysis revealed no significant difference in male and female resident EPA entrustment levels submitted by faculty (estimate = 0.09, p = 0.08) (Table 3). However, when compared to the faculty assessments (gold standard), analysis of resident self-assessments showed that female residents significantly underrated their own performance relative to their male resident counterparts (estimate = −0.29, p < 0.001) (Table 4). These findings were confirmed by ordered logistic regression (Figure 1). For context, each increasing PGY level tended to increase EPA entrustment levels by ~0.4, regardless of resident or faculty assessor (Tables 4–5), thus a decreased self-assessment of −0.29 reflects an incremental decrease that equates to almost a full PGY level. The interaction between faculty and resident gender (e.g. female faculty/female resident, female faculty/male resident, male faculty/female resident, and male faculty/male resident) was also examined in the initial model, but no interaction was found, so this term was removed from the final analysis. Again, a sensitivity analysis performed on only on those data for which both resident and faculty submitted an assessment confirmed these results (Supplemental Tables 1–2).
Table 3:
Effect of Resident Gender on Faculty EPA Assessments After Controlling for Confounding Variables
| Variable | Estimate* | P Value ** | |
|---|---|---|---|
| Resident Gender | Female | Ref | -- |
| Male | 0.09 | 0.08 | |
| Faculty Gender | Female | Ref | -- |
| Male | −0.04 | 0.46 | |
| PGY | PGY 1 | Ref | -- |
| PGY 2 | 0.30 | < 0.01 | |
| PGY 3 | 0.97 | < 0.001 | |
| PGY 4 | 1.34 | < 0.001 | |
| PGY 5 | 1.99 | < 0.001 | |
| EPA | Gallbladder Disease | Ref | -- |
| GS Consult | 0.05 | 0.66 | |
| Inguinal Hernia | −0.21 | 0.01 | |
| RLQ Pain | 0.40 | < 0.001 | |
| Trauma | 0.01 | 0.93 | |
| Faculty Department | Emergency Medicine | Ref | -- |
| General Surgery | 0.39 | 0.001 | |
| Faculty Years Experience | −0.01 | < 0.001 | |
| Phase of Care | Intraoperative | Ref | -- |
| Non-operative | 0.69 | < 0.001 | |
Estimate refers to the increase or decrease in mean EPA level from the reference. Different categories in each variable are compared to the reference (e.g every PGY level is compared to the reference PGY1)
P values were determined using multiple linear regression.
Adjusted R-squared: 0.52
n = 1,230
Table 4:
Effect of Resident Gender on Resident EPA Self- Assessments After Controlling for Confounding Variables
| Variable | Estimate* | P Value** | |
|---|---|---|---|
| Resident Gender | Female Resident | Ref | -- |
| Male Resident | 0.29 | < 0.001 | |
| Faculty Gender | Female Faculty | Ref | -- |
| Male Faculty | −0.01 | 0.89 | |
| PGY | PGY 1 | Ref | -- |
| PGY 2 | 0.44 | < 0.001 | |
| PGY 3 | 0.89 | < 0.001 | |
| PGY 4 | 1.33 | < 0.001 | |
| PGY 5 | 1.95 | < 0.001 | |
| EPA | EPA Gallbladder Disease | Ref | -- |
| EPA GS Consult | 0.01 | 0.93 | |
| EPA Inguinal Hernia | −0.34 | < 0.001 | |
| EPA RLQ Pain | 0.32 | < 0.001 | |
| EPA Trauma | −0.24 | 0.01 | |
| Faculty Department | Faculty Department- Emergency Medicine | Ref | -- |
| Faculty Department- General Surgery | −0.25 | 0.16 | |
| Faculty Years Experience | Faculty Years Experience | 0.01 | 0.08 |
| Phase of Care | Intraoperative Phase of Care | Ref | -- |
| Non-operative Phase of Care | 0.56 | < 0.001 | |
Estimate refers to the increase or decrease in mean EPA level (0–4) from the reference. Different categories in each variable are compared to the reference (e.g every PGY level is compared to the reference PGY1)
P values were determined using multiple linear regression.
Adjusted R-squared: 0.48
n = 1,250
Figure 1: Impact of Resident Gender on Resident Self-Assessments vs Faculty Assessments: Ordered Logistic Regression*.

Similar to the results shown in the linear regression, male residents rate themselves higher than female residents. However, faculty rate male and female residents equally. Higher PGY levels are associated with higher EPA scores regardless of assessor, and non-operative EPAs are associated with higher EPA scores than operative EPAs.
* Results are reported as log-odds ratios rather than traditional exponentiated ratios for confidence interval symmetry (e.g. e(0) = 1, e(2.5) = 12.182). Results for categorical variables with > 2 categories are shown relative to the reference variable (e.g. PGY2 compared to PGY1, PGY3 compared to PGY1, etc, and each EPA type is compared to the referent Gallbladder Disease).
Table 5:
LDA Topic Modeling of Resident Self-Assessment EPA Comments
| Resident Comments | ||||
|---|---|---|---|---|
| PGY1 n = 1,391 words |
PGY2 n = 4,159 words |
PGY3 n = 3,499 words |
PGY4 n = 3,051 words |
PGY5 n = 1,349 words |
|
Male Resident: (γ = 1.0) Patient Plan Comfortable Consult Independently Attending Operation Movements Instruments Consultation |
Male Resident: (γ= 0.62 for topic 1, 0.38 for topic 2) Female Resident: (γ= 0.48 for topic 1, 0.52 for topic 2) Topic 1: Appendix Consult Patient Team Plan Trauma Perform Dissection Tissue Communicated Topic 2: Patient Plan Dissection Time Required Primary Level Management Guidance Appendix |
Male Resident: (γ= 0.47 for topic 1, 0.53 for topic 2) Female Resident: (γ= 0.62 for topic 1, 0.38 for topic 2) Topic 1: Patient Gallbladder Plan Dissection Steps Independently Operation Placement Safely Critical Topic 2: Dissection Patient Assistance Independently Steps Perform Technique Critical Supervision |
Male Resident: (γ = 1.0) Trauma Patient Plan Independently Consult Evaluated Difficult Team Level Management |
Male Resident: (γ = 1.0) Junior Dissection Anatomy Resident level Gallbladder Operation Operative Plane Difficult |
|
Female Resident: (γ = 1.0) Patient Guidance Steps History Anatomy Laparoscopic Identify Dissection Feel Direct Basic |
Female Resident: (γ = 1.0) Plan Staff Patient Perform Trauma Team Oversight Minimal Supervision Treatment |
Female Resident: (γ = 1.0) Dissection Challenging Patient Identify Junior Planes Comfortable Resident Tissue Minimal |
||
LDA topic modeling was used to analyze the gender differences in narrative comments throughout PGY levels for faculty assessments and resident self-assessment.
This table was created using the three step process described in Supplemental Figure 1. Each dataset is represented as a single column representing a given assessor and PGY level (e.g. PGY1 resident self-evaluations). In step 1, LDA was performed on each dataset. The gamma score was used to see if the LDA algorithm “worked”, that is, uncovered latent topics that correlated with resident gender. A gamma score of ~0.5 meant the algorithm “didn’t work” and failed to identify topics that correlated with resident gender. These topics were not further analyzed, and are represented by cells with two topics in them, written in italics.
On the other hand, gamma scores of ~1 meant the algorithm “worked” at identifying latent topics correlated with each gender. These topics are reported in individual cells under the corresponding gender. These topics were then further analyzed for differences in words related to autonomy. If differences were found, the words represented them are highlighted in bold. If not, all words are left in normal type (e.g. resident comments for PGY5 above).
Narrative Analysis – Faculty Assessments:
The LDA algorithm was able to identify topics that corresponded well to resident gender differences for PGY1–4 resident faculty assessments (γ~1.0). Topics identified for faculty assessments of PGY5 residents did not correlate with gender differences (γ~0.5). When the word lists for faculty assessments were analyzed, the topics seemed to be either highly similar or to relate primarily to performance of different types of EPAs (Supplemental Table 3).
Narrative Analysis – Resident Self-Assessments:
For resident self-assessments, resident gender corresponded to the LDA topics generated for PGY1, PGY4, and PGY5 only. The topics identified for PGY1 and PGY4 residents demonstrated a tendency for female residents to focus on the assistance that they needed to complete the EPA, while male residents were more likely to discuss their independent action. Two distinct LDA topics also corresponded well for male and female PGY5 resident self-assessments; however, the words compromising each topic showed no apparent differences in descriptions of autonomy.
For example, the topic associated with male PGY1 residents included words like “independently” (e.g. “I felt comfortable incising and draining the patient’s abscess independently at bedside”) while the topic associated with female PGY1 residents included words such as “guidance” (e.g. “Able to perform basic laparoscopic maneuvers but only with direct clear guidance/instruction”) (Table 5). Similarly, PGY4 males use words such as “independently” and “management” (“Able to assess and counsel independently.”). In contrast, female PGY4 residents use words such as “staff,” “oversight,” and “supervision” (“Able to perform trauma Chief role with minimal supervision”) (Table 5).
To further understand the PGY-year specific differences in narrative comments submitted by male and female residents, gender differences in resident self-evaluation vs faculty entrustment levels were compared for each PGY level (Figure 2). The pattern generally corresponded with the differences seen in narrative self-assessments: PGY1, PGY2, and PGY4 resident self-assessments all demonstrated significantly higher mean EPA entrustment levels for males than female residents (all p < 0.01), while PGY3 and PGY5 residents did not demonstrate gender differences in self-assessed entrustment levels. However, faculty assessments of male and female residents were similar across all PGY levels with the exception of PGY2 (Figure 2).
Figure 2: Mean EPA Levels by PGY Level and Resident Gender.

These unadjusted mean EPA levels show that male residents tend to self-assess higher than female residents, while faculty assess male and female residents similarly. The differences in self-assessment are particularly striking at major residency transition years (PGY1- transition to residency, PGY4- transition to chief role).
DISCUSSION:
Faculty evaluations are the “gold standard” for surgery resident assessment. Fortunately, this study showed no differences in faculty EPA assessment entrustment levels given to male and female surgery residents. However, female residents rate their own performance significantly lower than their male counterparts relative to this “gold standard”. Further analysis of this difference revealed PGY-specific differences in both the EPA entrustment levels residents self-assigned, and in the narrative comments on their own performance. In major transitions years such as PGY1 (transition to residency) and PGY4 (transition to chief role), female residents are quantitatively underrating their performance by an entire PGY year and are more likely to write reflective comments that focus more on the assistance they did (or did not) need to perform the EPA task, rather than their independent actions.
One explanation for the significantly lower female self-assessment and the gender differences in the EPA narrative feedback is that females often experience “imposter syndrome” to a greater degree than their male peers.19 The “imposter syndrome” describes a condition in which an individual feels like a fraud and subsequently doubts their abilities.20 This can be especially prevalent in traditionally male-dominated fields like general surgery. Imposter syndrome contributes to the confidence gap seen between genders, which places women at an even greater disadvantage in a profession like surgery, where confidence is rewarded with more autonomy and leadership roles.2 Another explanation is that when women display confidence and assertiveness, they are challenging gender-schemas and are met with opposition. Women displaying these typically male-associated traits are perceived as “bossy” or “overbearing” because of the perceived threat to gender norms.2,,21,22 In attempting to maintain the difficult balance of exuding confidence without being perceived negatively, many female physicians default to under-stating and underrating their performance and capabilities.23
This study demonstrates that EPAs, which are characterized by measurable actions and competencies, may be a powerful tool in limiting gender bias in faculty evaluators. However, these assessments are not sufficient to eliminate the gender differences in residents’ self-perception. Recent articles regarding gender disparities in surgery argue that because women tend to underrate themselves, attending physicians must recognize these tendencies and respond by entrusting female physicians with important tasks earlier and more often.2,16 Additionally, surgical teachers must help these developing female surgeons spin their narrative from one of guidance and assistance to a narrative emphasizing their own independence and capabilities. For example, faculty can use more direct encouragement while in the operating room for female residents at these levels. Faculty should also be cognizant of the words they choose when giving feedback and focus on explicitly telling female residents that they are progressing towards higher levels of entrustment and independence. If surgical educators help develop female confidence in surgery, especially at critical career transition periods, we will hopefully see this self-confidence continue as these same female surgeons ascend in their profession. Additionally, female surgeons in training must gain awareness of their tendency to underate themselves to a greater degree than their male counterparts. With this awareness, they can continuously question whether they are adequately recognizing their autonomy and subsequently gain confidence in their independence and skill progression.
Although this study is the first to examine the gender differences in the novel EPA competency-based assessment framework piloted by the American Board of Surgery for general surgery residency, it is not without limitations. Because there have only been five EPAs currently developed for general surgery, the data is limited to a small subset of clinical situations that general surgery residents are exposed to. Additionally, these findings from a single institution may not be generalizable to other institutions because of differing workplace cultures, access to female mentorship, the gender makeup of programs at other institutions, or the way that EPAs are implemented with respect to faculty and resident development. However, our data collection extended over two full academic years, so these findings do reflect multiple classes of general surgery trainees at each PGY level. Unfortunately, while there is a central data repository for the EPA pilot trial, it only contains summative clinical competency entrustment decisions (determined every 6 months) rather than the individual EPA assessment scores and comments analyzed here. Additionally, each EPA site was free to implement EPAs in whatever manner they felt would be most effective. This led to multiple different implementation styles and softwares, which would limit the validity of certain multi-institutional analyses. For example, many EPA sites did not collect free text comment feedback, which would limit our ability to perform a multi-institutional NLP analysis. Future work will involve collaboration with other EPA sites that performed similar EPA implementations to improve the generalizability of these findings. These collaborations will provide a larger pool of residents, enabling extension of this work with more granular analyses of resident-level data in addition to the aggregate assessment-level data presented in this work.
CONCLUSION:
Faculty EPA assessments showed no differences in EPA entrustment levels of male and female residents, while resident self-assessments showed that female residents were rating themselves lower than their male counterparts by nearly an entire PGY level. This tendency was also reflected in the narrative feedback of PGY1 and PGY4 female residents who emphasized their need for supervision, rather than the autonomy they displayed. This study demonstrates that EPAs are an effective form of competency-based assessment that can help to limit gender bias from an outside faculty evaluator when implemented with appropriate faculty development.
However, gender differences in self-perception persisted despite this new type of assessment. Surgical educators must gain awareness of the tendency of female residents to under-estimate their abilities – especially at key transitions during training such as the beginning of surgical residency (PGY1) and the transition to chief resident roles (PGY4). As the EPA assessment framework is rolled out broadly across institutions, these data provide key observations that can be utilized to tailor faculty feedback and to empower residents to recognize their progressive independence.
Supplementary Material
Grant Support:
Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number T32 CA090217. This content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
REFERENCES
- 1.2018 ACS Governors Survey: Gender inequality and harassment remain a challenge in surgery. The Bulletin. Published September 1, 2019. Accessed February 8, 2020. https://bulletin.facs.org/2019/09/2018-acs-governors-survey-gender-inequality-and-harassment-remain-a-challenge-in-surgery/ [Google Scholar]
- 2.Stephens EH, Heisler CA, Temkin SM, Miller P. The Current Status of Women in Surgery: How to Affect the Future. JAMA Surg. Published online July 8, 2020. doi: 10.1001/jamasurg.2020.0312 [DOI] [PubMed] [Google Scholar]
- 3.Bates C, Gordon L, Travis E, et al. Striving for Gender Equity in Academic Medicine Careers: A Call to Action. Acad Med J Assoc Am Med Coll. 2016;91(8):1050–1052. doi: 10.1097/ACM.0000000000001283 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Gerull KM, Loe M, Seiler K, McAllister J, Salles A. Assessing gender bias in qualitative evaluations of surgical residents. Am J Surg. 2019;217(2):306–313. doi: 10.1016/j.amjsurg.2018.09.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Klein R, Julian KA, Snyder ED, et al. Gender Bias in Resident Assessment in Graduate Medical Education: Review of the Literature. J Gen Intern Med. 2019;34(5):712–719. doi: 10.1007/s11606-019-04884-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Fassiotto M, Li J, Maldonado Y, Kothary N. Female Surgeons as Counter Stereotype: The Impact of Gender Perceptions on Trainee Evaluations of Physician Faculty. J Surg Educ. 2018;75(5):1140–1148. doi: 10.1016/j.jsurg.2018.01.011 [DOI] [PubMed] [Google Scholar]
- 7.Thompson-Burdine J, Sutzko DC, Nikolian VC, et al. Impact of a resident’s sex on intraoperative entrustment of surgery trainees. Surgery. 2018;164(3):583–588. doi: 10.1016/j.surg.2018.05.014 [DOI] [PubMed] [Google Scholar]
- 8.Fonseca AL, Reddy V, Longo WE, Udelsman R, Gusberg RJ. Operative confidence of graduating surgery residents: a training challenge in a changing environment. Am J Surg. 2014;207(5):797–805. doi: 10.1016/j.amjsurg.2013.09.033 [DOI] [PubMed] [Google Scholar]
- 9.Stahl CC, Collins E, Jung SA, et al. Implementation of Entrustable Professional Activities into a General Surgery Residency. J Surg Educ. 2020;77(4):739–748. doi: 10.1016/j.jsurg.2020.01.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Brasel KJ, Klingensmith ME, Englander R, et al. Entrustable Professional Activities in General Surgery: Development and Implementation. J Surg Educ. 2019;76(5):1174–1186. doi: 10.1016/j.jsurg.2019.04.003 [DOI] [PubMed] [Google Scholar]
- 11.Norman G Likert scales, levels of measurement and the “laws” of statistics. Adv Health Sci Educ. 2010;15(5):625–632. doi: 10.1007/s10459-010-9222-y [DOI] [PubMed] [Google Scholar]
- 12.Lumley T, Diehr P, Emerson S, Chen L. The Importance of the Normality Assumption in Large Public Health Data Sets. Annu Rev Public Health. 2002;23(1):151–169. doi: 10.1146/annurev.publhealth.23.100901.140546 [DOI] [PubMed] [Google Scholar]
- 13.Fox J, Weisberg S. An R Companion to Applied Regression. Third. Sage; 2019. https://socialsciences.mcmaster.ca/jfox/Books/Companion/ [Google Scholar]
- 14.Jelodar H, Wang Y, Yuan C, et al. Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey. ArXiv171104305 Cs. Published online December 5, 2018. Accessed June 26, 2020. http://arxiv.org/abs/1711.04305 [Google Scholar]
- 15.Grün B, Hornik K. topicmodels: An R Package for Fitting Topic Models. J Stat Softw. 2011;40(1):1–30. doi: 10.18637/jss.v040.i13 [DOI] [Google Scholar]
- 16.Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. J Mach Learn Res. 2003;3(Jan):993–1022. [Google Scholar]
- 17.Stahl CC, Jung SA, Rosser AA, et al. Natural language processing and entrustable professional activity text feedback in surgery: A machine learning model of resident autonomy. Am J Surg. Published online November 26, 2020. doi: 10.1016/j.amjsurg.2020.11.044 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Robinson D and S J 6 Topic Modeling | Text Mining with R. Accessed July 7, 2020. https://www.tidytextmining.com/
- 19.Butkus R, Serchen J, Moyer DV, Bornstein SS, Hingle ST. Achieving Gender Equity in Physician Compensation and Career Advancement: A Position Paper of the American College of Physicians. Ann Intern Med. 2018;168(10):721–723. doi: 10.7326/M17-3438 [DOI] [PubMed] [Google Scholar]
- 20.Greenberg CC. Association for Academic Surgery presidential address: sticky floors and glass ceilings. J Surg Res. 2017;219:ix–xviii. doi: 10.1016/j.jss.2017.09.006 [DOI] [PubMed] [Google Scholar]
- 21.Meyerson SL, Odell DD, Zwischenberger JB, et al. The effect of gender on operative autonomy in general surgery residents. Surgery. 2019;166(5):738–743. doi: 10.1016/j.surg.2019.06.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Babchenko O, Gast K. Should We Train Female and Male Residents Slightly Differently? JAMA Surg. Published online March 4, 2020. doi: 10.1001/jamasurg.2019.5887 [DOI] [PubMed] [Google Scholar]
- 23.S by K K and Shipman C The Confidence Gap. The Atlantic. Accessed July 14, 2020. https://www.theatlantic.com/magazine/archive/2014/05/the-confidence-gap/359815/ [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
