Skip to main content
Journal of General Internal Medicine logoLink to Journal of General Internal Medicine
. 2024 Jan 30;39(10):1795–1802. doi: 10.1007/s11606-024-08645-6

Gender Differences in Work-Based Assessment Scores and Narrative Comments After Direct Observation

Janae K Heath 1,, Jennifer R Kogan 1, Eric S Holmboe 2, Lisa Conforti 2, Yoon Soo Park 3, C Jessica Dine 1
PMCID: PMC11282012  PMID: 38289461

Abstract

Background

While some prior studies of work-based assessment (WBA) numeric ratings have not shown gender differences, they have been unable to account for the true performance of the resident or explore narrative differences by gender.

Objective

To explore gender differences in WBA ratings as well as narrative comments (when scripted performance was known).

Design

Secondary analysis of WBAs obtained from a randomized controlled trial of a longitudinal rater training intervention in 2018–2019. Participating faculty (n = 77) observed standardized resident–patient encounters and subsequently completed rater assessment forms (RAFs).

Subjects

Participating faculty in longitudinal rater training.

Main Measures

Gender differences in mean entrustment ratings (4-point scale) were assessed with multivariable regression (adjusted for scripted performance, rater and resident demographics, and the interaction between study arm and time period [pre- versus post-intervention]). Using pre-specified natural language processing categories (masculine, feminine, agentic, and communal words), multivariable linear regression was used to determine associations of word use in the narrative comments with resident gender, race, and skill level, faculty demographics, and interaction between the study arm and the time period (pre- versus post-intervention).

Key Results

Across 1527 RAFs, there were significant differences in entrustment ratings between women and men standardized residents (2.29 versus 2.54, respectively, p < 0.001) after correction for resident skill level. As compared to men, feminine terms were more common for comments of what the resident did poorly among women residents (β 0.45, CI 0.12–0.78, p 0.01). This persisted despite adjusting for the faculty’s entrustment ratings. There were no other significant linguistic differences by gender.

Conclusions

Contrasting prior studies, we found entrustment rating differences in a simulated WBA which persisted after adjusting for the resident’s scripted performance. There were also linguistic differences by gender after adjusting for entrustment ratings, with feminine terms being used more frequently in comments about women in some, but not all narrative comments.

Supplementary Information

The online version contains supplementary material available at 10.1007/s11606-024-08645-6.

KEY WORDS: assessment, gender bias, work-based assessment, medical education

INTRODUCTION

Gender-based differences have repeatedly been demonstrated in faculty assessments of medical trainees.111 These differences have been observed in quantitative measurements,4,5 where women residents receive lower competency ratings despite no differences in performance. Similarly, there are gender-based linguistic differences in qualitative assessments, such as end-of-rotation narrative evaluations2,3 and letters of recommendation.12 Given the paramount importance of evaluation data for residency competency decisions, identifying assessment options that provide unbiased data is critical.

One proposed strategy to counter these gender-based assessment differences is through improving specificity with the use of behavioral-based assessments.13 Workplace-based assessments (WBAs), which are intended to assess what physicians are doing in practice,14 represent a key component of competency-based medicine assessments and offer these advantages.1517 WBAs generally include an entrustment rating scale and a space for written comments. WBA narrative assessments can yield high-quality, behaviorally specific feedback, and therefore are considered a potential option in the efforts to mitigate gender-based assessment differences.

Importantly, in contrast to multiple studies that have shown gender-based differences in end-of-rotation evaluations, summative assessments, and recommendation letters, recent studies focusing on numeric entrustment ratings have not shown similar gender differences.18 While these results support the hypothesis that use of WBAs may have a mediating effect against gendered assessment differences, it raises questions about whether this work missed potential gender bias. Specifically, this prior work implies that the equal numeric ratings across genders represent an absence of bias. Alternatively, as their work was unable to link ratings to actual clinical performance, the absence of rating differences could instead indicate a stringency bias against higher-performing women trainees. In the study by Weber et al., the women were noted to have more nominations into medical honor societies, so may have been expected to have a higher performance overall all else being equal.18 It is therefore surprising that these individuals were rated similarly to their male counterparts. Additional studies have also demonstrated higher performance metrics for women physicians,19,20 arguing that equal ratings between genders would be unusual.

Based on this uncertainty, reassessing for the presence of gender bias in entrustment ratings where the learner performance level is known is an important next step. In addition, as prior work has identified linguistic differences in evaluation in the absence of rating differences,3,21 it is important to additionally evaluate whether gender bias is present in the WBA narratives. It is possible that WBA narratives may identify unique areas of bias not observed in the ratings as has been observed in other narrative assessments. To the best of our knowledge, there has not been research exploring the presence of gender bias in narrative assessments in WBAs associated with entrustment ratings.

Therefore, the objectives of this study were the following: (1) to evaluate for entrustment rating differences across gender when the resident performance level was known (within a controlled setting); and (2) to assess for gender differences in the narrative comments of WBAs.

METHODS

This study is a secondary analysis of narrative WBAs obtained from a previously published randomized controlled trial of a longitudinal rater training intervention conducted between October 2018 and May 2019. Full details of the recruitment and rater training have been detailed elsewhere.22,23 The University of Pennsylvania Institutional Review Board approved this study.

Study Setting and Participants

Between December 2017 and August 2018, Family Medicine and Internal Medicine physicians were recruited to participate, spanning 138 programs across six Midwest states and 186 programs across five Mid-Atlantic states. Eligible participants were generalist teaching faculty who (1) were responsible for outpatient clinical training and resident assessment, (2) provided outpatient care for their own patient panel, and (3) held a faculty position for at least 1 year.

In the original trial, 94 eligible faculty who agreed to participate were randomized to either the intervention (WBA rater training) or control group. Participants (n = 77) who completed a self-administered demographic web-based questionnaire and video assessments are included in the current analysis.

Pre-intervention WBAs were obtained between October and November 2018. Participants watched ten videos of a standardized resident taking a history from or counseling a standardized patient. Development and validation of the videos have previously been described.22 The standardized videos included three skill levels for the residents. For each video, standardized residents were scripted to a predetermined skill level determined by the investigators and expert panel. Each video had a unique script between the standardized resident and patient. Details of the standardized residents, including gender, performance level, and task, are included in Supplemental Table 1. The standardized videos included 18 possible videos (nine for history-taking, nine for counseling). The standardized encounters included eight women residents (44%), scripted at skill levels 1 (n = 3), 2 (n = 3), and 3 (n = 2), and 10 men residents (56%), scripted at skill levels 1 (n = 4), 2 (n = 4), and 3 (n = 2). Each participant was shown videos in one of three random orders.

After watching the standardized resident–patient encounter, participants assessed each resident using an online rater assessment form (RAF) that included five free-text questions: (1) what the resident did well; (2) what positive feedback should be prioritized; (3) what deficiencies or errors were made; (4) what corrective feedback should be prioritized; and (5) a brief overall summary of the resident’s performance. Participants were then asked to assign a prospective supervision entrustment rating ranging from observer only to unsupervised practice, to identify what level of supervision trainees would require in the future (see Supplemental Table 2).

Study Intervention (Rater Training)

Intervention group participants attended two in-person, 3-h rater training workshops that included performance dimension training and frame-of-reference training, the details of which have been previously published.22 This was followed by three asynchronous online spaced-learning frame-of-reference training modules (every 6 weeks, starting 4 weeks after the in-person workshops). The rater training did not specifically address gender bias, but was designed to help faculty observers based their assessments on specific, evidence-based behaviors for medical interviewing and counseling. The control group did not receive any rater training.

Post-intervention Assessments by Raters

Between March and May 2019, participants watched and rated 10 additional videos (five history-taking and five counseling videos) in one of three randomly assigned orders using the RAF.

Data Analysis

The dataset included faculty demographics (gender, specialty, age, experience, and role), standardized resident demographics (gender, race, post-graduate year, skill level), the skill assessed (history-taking or counseling), and the faculty’s narrative comments and entrustment ratings. The dataset also included whether the faculty was in the intervention or control group and whether the RAF was part of the pre- or post-intervention time period.

We used descriptive statistics to summarize demographic questions and t-tests to assess baseline differences between the intervention and control groups. We analyzed the entrustment ratings as a continuous variable to assess for differences between resident genders using mixed-effects linear regression (adjusting for the resident scripted skill level, resident PGY level, faculty age, faculty gender). We included faculty gender within the model given prior studies showcasing the interaction between rater gender and ultimate ratings.24,25 This model was used intentionally to adjust for the non-independent nature of raters throughout the dataset (i.e., raters have idiosyncracies that may impact assessments and narrative comments).26 The model provides adjusted standard errors accounting for rater clustering (random effects) throughout the dataset. As the cohort included pre- and post-intervention data (and involvement in the intervention may have differentially impacted the outcome), the regression was also adjusted for time (pre- versus post-intervention), arm of study (intervention versus control), and the interaction between the two.

Linguistic Inquiry and Word Count (LIWC-22) Analysis

We used LIWC-22 (Linguistic Inquiry and Word Count)27,28 to analyze the narrative comments. LIWC-22 has previously been used to evaluate gender bias within medical educational assessment settings,29 and allows for quantitative analysis of multiple language domains based on pre-specified word dictionaries.

The LIWC-22 dictionaries organize words, word stems, and phrases within various psychosocial constructs and ultimately quantifies constructs into a single summary numeric variable. For example, LIWC-22 will quantify the tone of text into a continuous variable ranging from negative to positive emotional tone (a higher value equates with more positive tone).27,30,31 Using these LIWC-22 dictionaries, we assessed the word count and the general tone of the narrative comments.

Based on prior literature showing differences in the use of agentic terms (terms that incorporate an individual’s ability to assert themselves, as well as competence and achievements) and communal terms (terms that highlights an individual’s desire to relate and cooperate) for men and women, we assessed for differences in the use of “masculine” terms, “feminine” terms, “agentic” terms, and “communal” terms using pre-developed LIWC-22 dictionaries.12,3234 See Table 1 for examples within each category, and Supplemental Table 3 for complete list of words within each category.

Table 1.

Sample Words in LIWC-22 Dictionaries Used in Rater Assessment Tool Free-Text Responses

LIWC-22 dictionary Example words Example evaluation
Agentic terms “Independent” “The resident has mastered the cognitive information of this encounter.”
“Masterful”
“Smart”
“Ambitious”
“Skilled”
Communal terms “Compassionate” “She made good eye contact. Sat face-to-face with patient with good body posture and language.”
“Interpersonal”
“Kind”
“Selfless”
“Inclusive”
Masculine terms “Confident” “Great job applying his knowledge into education for the patient. He is really skilled.”
“Decisive”
“Autonomy”
“Intellectual”
“Persistent”
Feminine terms “Cooperative” “She was very attentive to patient and seemed caring overall.”
“Empathetic”
“Polite”
“Considerate”
“Gentle”

We performed a descriptive analysis of the LIWC-22 categories (word count, general tone of the narrative comments, and the presence of agentic terms, communal terms, masculine terms, and feminine terms). We evaluated the summary statistics of the domains by resident gender and skill level, faculty gender, and age.

We then performed a mixed-effects linear regression analysis to assess the association of the LIWC-22 categories with resident gender, adjusted for the predetermined resident skill level, faculty gender, faculty age, and the faculty assigned entrustment rating, as well as time (dichotomous coding of pre- versus post-intervention), arm of study (intervention versus control), and the interaction between the study arm and time. The model was clustered on rater (random effects).

For all analyses, statistical significance was determined using a p-value of 0.01 (using a Bonferroni correction for multiple comparisons). All statistical analyses and additional text analysis were completed using Stata 17.0 (StataCorp 2021, College Station, TX, StataCorp LP) and LIWC-22 (LIWC2022, Austin, TX, Pennebaker Conglomerates, Inc.).

RESULTS

Sample Characteristics

RAFs from the 77 participants (41 in the intervention group) who completed the demographic survey and assessed the videos were included for analysis. Table 2 summarizes participant demographics. There were no significant differences in participant demographic characteristics between the intervention and control groups.

Table 2.

Demographics of Faculty Completing Rater Assessment Forms

Variable Control
N = 36
Intervention
N = 41
p-value
Gender, n (%) 0.61
  Woman 24 (67) 25 (61)
  Man 12 (33) 16 (39)
Age in years mean (SD) 46.9 (10.2) 43.3 (9.8) 0.12
Years post-residency mean (SD) 15.5 (11.3) 11.3 (9.8) 0.08
Completed fellowship (n, %) 13 (36) 9 (22) 0.17
Primary specialty (n, %) 0.25
  Internal medicine 20 (56) 28 (68)
  Family medicine 16 (44) 13 (32)
Academic rank (n, %)
  Instructor 5 (14) 4 (10) 0.57
  Assistant professor 12 (33) 18 (44) 0.34
  Associate professor 5 (14) 10 (24) 0.25
  Professor 4 (11) 1 (2) 0.12
  Other/not applicable 10 (28) 8 (20) 0.39
Institution type (n, %)
  University-based 10 (28) 17 (41) 0.21
  Community-based, university-affiliated 14 (39) 15 (37) 0.84
  Community-based program, non-university-affiliated 12 (33) 8 (20) 0.17
  Other 0 1 (2) 0.35
Educational leadership roles (n, %)
  Program, associate, or assistant director 16 (44) 20 (49) 0.96
  Core faculty 21 (58) 21 (51) 0.53
  Clinical competency committee chair or member 12 (33) 16 (39) 0.44
  Other (resident clinic site director, assistant/associate fellowship director, medical school clinical rotation course director) 8 (22) 14 (34) 0.25
  None 5 (14) 5 (12) 0.83
Gender of standardized resident in completed RAFs (n, %) 0.96
  Standardized women resident 323 (47) 369 (53)
  Standardized man resident 397 (47) 451 (53)
Gender concordance of RAFs (n, % total RAFs within arm of study) 0.94
  Man rater – man resident 133 (18) 176 (21)
  Woman rater – woman resident 215 (30) 226 (28)
  Man rater – woman resident 108 (15) 143 (17)
  Woman rater – man resident 264 (37) 275 (34)

There were 13 RAFs which were incomplete and excluded from analysis (did not complete baseline assessments or all designated questions). There were 1527 RAFs in the final sample; 53% (n = 810) were completed by intervention participants. Half (n = 768) of assessment forms were completed at baseline, and 50% (n = 759) were completed as at follow-up. In the completed RAFs in the final sample, 40% of WBAs (n = 611) were completed for learners at the lowest level, 40% (n = 612) in the mid-level, and 20% (n = 304) at the highest performance level. The gender of the completed RAFs were 45% women residents (n = 692), and 55% (n = 848) men (Table 2).

Analysis of Entrustment Ratings

The prospective entrustment ratings for the full cohort spanned all options, outlined in Table 3. The gender of the rater was not associated with ultimate ratings (p 0.41). The mean prospective entrustment ratings were lower in women residents than in men (2.80 versus 2.93, p < 0.002). After adjustment for the scripted skill level of the learner, women residents were still rated lower than men for mean ratings (2.29 versus 2.54, p < 0.001). The assigned entrustment ratings and univariable and multivariable analyses are shown in Table 3. This observed gender difference in assigned numeric entrustment scores persisted regardless of participation in rater training (there was no significant interaction between the pre-post intervention and the control-intervention groups, p-value 0.23).

Table 3.

Entrustment Ratings Assigned to Men and Women Residents on Rater Assessment Forms

Frequency of entrustment rating in men residents Frequency of entrustment rating in women residents
Recommend observation only 35 (4%) 25 (4%)
Recommend direct observation 230 (27%) 250 (36%)
Recommend indirect observation 334 (40%) 247 (36%)
Recommend unsupervised practice 243 (29%) 163 (24%)
Mean entrustment rating in men residents (mean, 95% CI) Mean entrustment rating in women residents (mean, 95% CI) p-value
Mean entrustment rating 2.93 (2.87, 2.99) 2.80 (2.74, 2.86) 0.002
Mean entrustment rating after multivariable adjustmenta 2.54 (2.28, 2.81) 2.29 (1.96, 2.63)  < 0.001

aAdjusted for resident pre-specified skill level (1, 2, or 3), standardized resident PGY level, standardized resident race, faculty demographics (age, gender), time (pre- versus post-intervention), arm of study (intervention versus control), and the interaction between time and arm of study, clustered on individual faculty member

LIWC-22 Analysis of Narrative Evaluations

There were significant differences in word count by resident gender, depending on the question prompt. Narrative comments about what the resident did well were significantly longer in men (52.4 words versus 40.6 words; p < 0.001). There were no significant word count differences in prioritized positive feedback (36 versus 36; p = 0.88), corrective feedback (59.8 versus 63.9; p = 0.09), prioritized corrective feedback (38.3 versus 40.7; p = 0.11), or overall comments (36.8 versus 35.3; p = 0.15).

Table 4 summarizes the mixed-effects regression coefficients, showing the association between women residents and the tone of the narrative comment, as well as the use of agentic, communal, masculine, and feminine terms, for each question prompt.

Table 4.

Adjusteda Associations Between Resident Gender (Women) and LIWC-22 Domains Across WBA Narrative Assessments

Regression coefficient for women residents (ß, 95% CI) p-valueb
Question 1: What the resident did well
Overall tone  − 0.44 (− 4.11, 3.23) 0.82
Use of communal words 0.32 (0.94, 0.31) 0.32
Use of agentic words  − 0.15 (− 0.60, 0.30) 0.51
Use of feminine words 0.16 (− 0.39, 0.70) 0.58
Use of masculine words  − 0.10 (− 0.21, 0.01) 0.07
Question 2: Prioritized positive feedback
Overall tone  − 2.72 (− 6.53, 1.10) 0.16
Use of communal words  − 0.37 (− 1.06, 0.32) 0.30
Use of agentic words 0.09 (− 0.44, 0.61) 0.74
Use of feminine words 0.04 (− 0.53, 0.61) 0.89
Use of masculine words  − 0.07 (− 0.21, 0.07) 0.32
Question 3: What the resident did poorly
Overall tone  − 2.15 (− 5.33, 1.04) 0.19
Use of communal words 0.41 (0.05. 0.77) 0.02
Use of agentic words 0.38 (− 0.86, 0.10) 0.12
Use of feminine words 0.24 (0.07, 0.41) 0.007
Use of masculine words  − 0.08 (− 0.17, − 0.01) 0.05
Question 4: Prioritized corrective feedback
Overall tone  − 1.14 (− 4.75, 2.48) 0.54
Use of communal words 0.38 (− 0.06, 0.82) 0.09
Use of agentic words  − 0.58 (− 1.21, 0.05) 0.07
Use of feminine words 0.14 (− 0.11, 0.40) 0.27
Use of masculine words  − 0.06 (− 0.19, 0.06) 0.33
Question 5: Overall summary
Overall tone  − 2.29 (− 6.22, 1.64) 0.25
Use of communal words  − 0.35 (− 0.94, 0.23) 0.24
Use of agentic words 0.09 (− 0.65, 0.82) 0.82
Use of feminine words 0.26 (− 0.06, 0.57) 0.11
Use of masculine words  − 0.11 (− 0.40, 0.19) 0.47

aAdjusted for case (focused on history-taking versus counseling), faculty demographic characteristics (including faculty age and gender), faculty assigned prospective entrustment rating, time (pre- versus post-intervention), arm of study (intervention versus control), and the interaction between time and arm of study, clustered on individual faculty member. bStatistical significance of p-value < 0.01 based on Bonferroni correction for multiple comparisons

In the narrative question about what the resident did well and the prioritized positive feedback, there were no significant differences in the overall tone, or in the use of communal terms, agentic terms, feminine terms, or masculine terms between men and women residents.

When commenting on what the resident did poorly, rater assessments of women had more frequent use of feminine terms compared to men (β 0.24, CI 0.07–0.41, p 0.007) which included terms like “sensitive,” “responsive,” or “interpersonal,” after controlling for the impact of the control and intervention group effects.

There were no statistically significant differences between the tone of comments and use of agentic, communal, masculine, or feminine terms in the prioritized corrective feedback for women. In the overall summary, there were no differences in LIWC domains across men and women residents.

DISCUSSION

We found significant differences in prospective entrustment ratings between men and women residents after standardized resident–patient encounters, with women receiving lower prospective entrustment ratings. This finding persisted despite adjustment for scripted resident performance level (pre-set by the study team and confirmed by expert raters). While this study contrasts prior work which found no entrustment rating differences by gender, it is the first study that accounts for scripted skill of the resident. Therefore, it is possible that an absence of entrustment rating difference by gender in prior studies could potentially be rating women inappropriately low compared to their true performance, thus representing an ongoing source of bias. Our study suggests that WBA numeric ratings are prone to gender biases similar to those observed in other assessment domains.

In addition to differences in entrustment ratings, we found some linguistic differences in the narrative evaluations of men and women residents. We observed that feminine terms were more common in comments about women when the question focused on what the learner did poorly. These linguistic findings persisted despite adjusting for ratings (assigned by faculty), showcasing an additional and previously unrecognized source of gender bias in WBAs. This could suggest that raters have underlying and preconceived gendered expectations for communication skills and thereby disproportionally comment on communal attributes and “feminine” features as potential areas for women to improve. This raises ongoing questions about the role of societal norms and stereotypes in assessment in medical training.

One interesting finding in our analysis was the differences in word count in the free-text comments focused on what a resident did well, which were longer in men as compared to women. While there were no additional differences in word count length in the other prompts, this could suggest an implicit bias leading to highlighting more positive attributes in evaluations for men. Similar to the linguistic differences, this might suggest raters disproportionately comment on positives for men in patient interactions (i.e., such positive attributes may be considered the expectation for women, but unique for men). It is unclear how the length of the positive comments might impact a learner’s perception of their own skill, but one could hypothesize that these differences could interact with a learner’s self-efficacy and perceived skill set.

Given prior research on gender differences in narrative assessments, we were surprised by the lack of linguistic differences in the narrative comments for some of the question stems, particularly given the noted gender differences in the entrustment ratings. We hypothesize several explanations for this result. One possibility is that a prompt focused on negative feedback could be more sensitive to biased language, while other prompts focused on prioritizing feedback and/or positives may be less sensitive to this impact. Alternatively, it is possible that raters may aim to focus on specific actionable areas for prioritized corrective feedback, thus using different language categories as compared to unprioritized corrective feedback. The feminine and communal dictionary terms convey “softer” competencies and may be perceived to be less specific for feedback purposes. It is also possible that the study participants have previously undergone faculty development in gender bias mitigation, and/or the design of the rater form had an unexpected impact of bias mitigation. Ultimately, more research evaluating the impact of WBA question prompt on rater assessments and bias is needed.

Our study is the first to analyze gender bias in WBA numeric ratings and qualitative comments in a standardized setting. Our standardized setting allowed us to account for an individual’s true skill level prior to analyzing the ratings. Despite this, there are several limitations. Our study may have been impacted by selection bias, based on the inclusion of a select group of motivated faculty pursuing additional training in assessment, the majority of whom were in leadership positions. While this limits generalizability, we would expect this to shift the findings to the null (i.e., a non-select group of medical educators may have more bias than were present within our sample). While we expect the selection bias to impact the findings towards the null, it remains possible that the group may have been intentionally seeking additional training in evaluation and may have increased bias at baseline. The generalizability may have further been limited by which programs were ultimately represented within the final sample of participants, although to maintain anonymity of participants, we did not have access to this data for this secondary analysis. Beyond this, we also intentionally chose to evaluate a WBA with both a numeric component and narrative component, but the interplay between these with regard to bias is unknown, and would be an important area of future work. Finally, the standardized residents were men and women, and we were unable to assess how these results would differ in individuals identifying as non-binary or gender diverse. This would be an important area of future study.

Overall, we have identified gender differences in entrustment ratings in a standardized setting, despite adjusting for the scripted performance of the learner, with women being assigned lower prospective entrustment ratings compared to men. In addition, narrative comments about women more often included more feminine language when indicating areas the resident did poorly (even after adjustment for the rater-assigned prospective entrustment level). This suggests underlying gender biases and gendered expectations are prevalent in WBAs, both the numeric ratings and narrative evaluations.

Supplementary Information

Below is the link to the electronic supplementary material.

Acknowledgements:

Contributors: Not applicable

Funding

Not applicable

Data Availability

The datasets during and/or analyzed during the current study available from the corresponding author on reasonable request.

Declarations:

Conflict of Interest:

The authors declare that they do not have a conflict of interest.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Klein R, Julian KA, Snyder ED, et al. Gender bias in resident assessment in graduate medical education: review of the literature. J Gen Intern Med. 2019;34(5):712-719. [DOI] [PMC free article] [PubMed]
  • 2.Arkin N, Lai C, Kiwakyou LM, et al. What’s in a Word? Qualitative and quantitative analysis of leadership language in anesthesiology resident feedback. J Grad Med Educ. 2019;11(1):44-52. 10.4300/jgme-d-18-00377.1. [DOI] [PMC free article] [PubMed]
  • 3.Mueller AS, Jenkins TM, Osborne M, Dayal A, O’Connor DM, Arora VM. Gender differences in attending physicians’ feedback to residents: a qualitative analysis. J Grad Med Educ. 2017;9(5):577-585. 10.4300/JGME-D-17-00126.1. [DOI] [PMC free article] [PubMed]
  • 4.Klein R, Ufere NN, Rao SR, et al. Association of gender with learner assessment in graduate medical education. JAMA Netw Open. 2020;3(7):e2010888-e2010888. 10.1001/jamanetworkopen.2020.10888. [DOI] [PMC free article] [PubMed]
  • 5.Holmboe ES, Huot SJ, Brienza RS, Hawkins RE. The association of faculty and residents’ gender on faculty evaluations of internal medicine residents in 16 residencies. Acad Med. 2009;84(3):381-384. [DOI] [PubMed]
  • 6.Rand VE, Hudes ES, Browner WS, Wachter RM, Avins AL. Effect of evaluator and resident gender on the American Board of Internal Medicine evaluation scores. J Gen Intern Med. 1998;13(10):670-674. [DOI] [PMC free article] [PubMed]
  • 7.Turrentine FE, Dreisbach CN, St Ivany AR, Hanks JB, Schroen AT. Influence of gender on surgical residency applicants’ recommendation letters. J Am Coll Surg. 2019;228(4):356-365.e3. 10.1016/j.jamcollsurg.2018.12.020. [DOI] [PubMed]
  • 8.Li S, Fant AL, McCarthy DM, Miller D, Craig J, Kontrick A. Gender differences in language of standardized letter of evaluation narratives for emergency medicine residency applicants. AEM Educ Train. 2017;1(4):334-339. 10.1002/aet2.10057. [DOI] [PMC free article] [PubMed]
  • 9.Chen S, Beck Dallaghan GL, Shaheen A. Implicit gender bias in third-year surgery clerkship MSPE narratives. J Surg Educ. 2021;78(4):1136-1143. 10.1016/j.jsurg.2020.10.011. [DOI] [PubMed]
  • 10.Khan S, Kirubarajan A, Shamsheri T, Clayton A, Mehta G. Gender bias in reference letters for residency and academic medicine: a systematic review. Postgrad Med J. 2023;99(1170):272-278. 10.1136/postgradmedj-2021-140045. [DOI] [PubMed]
  • 11.Ross DA, Boatright D, Nunez-Smith M, Jordan A, Chekroud A, Moore EZ. Differences in words used to describe racial and gender groups in medical student performance evaluations. PLoS ONE. 2017;12(8):e0181659. 10.1371/journal.pone.0181659. [DOI] [PMC free article] [PubMed]
  • 12.Trix F, Psenka C. Exploring the color of glass: letters of recommendation for female and male medical faculty. Discourse Soc. 2003;14(2):191-220. 10.1177/0957926503014002277.
  • 13.Babal JC, Webber S, Nacht CL, et al. Recognizing and mitigating gender bias in medical teaching assessments. J Grad Med Educ. 2022;14(2):139-143. 10.4300/JGME-D-21-00774.1. [DOI] [PMC free article] [PubMed]
  • 14.Anderson HL, Kurtz J, West DC. Implementation and use of workplace-based assessment in clinical learning environments: a scoping review. Acad Med. 2021;96(11S):S164. 10.1097/ACM.0000000000004366. [DOI] [PubMed]
  • 15.Ginsburg S, Gold W, Cavalcanti RB, Kurabi B, McDonald-Blumer H. Competencies “plus”: the nature of written comments on internal medicine residents’ evaluation forms. Acad Med J Assoc Am Med Coll. 2011;86(10 Suppl):S30-34. 10.1097/ACM.0b013e31822a6d92. [DOI] [PubMed]
  • 16.Cook DA, Kuper A, Hatala R, Ginsburg S. When assessment data are words: validity evidence for qualitative educational assessments. Acad Med J Assoc Am Med Coll. 2016;91(10):1359-1369. 10.1097/ACM.0000000000001175. [DOI] [PubMed]
  • 17.Young JQ, Sugarman R, Holmboe E, O’Sullivan PS. Advancing our understanding of narrative comments generated by direct observation tools: lessons from the psychopharmacotherapy-structured clinical observation. J Grad Med Educ. 2019;11(5):570-579. 10.4300/JGME-D-19-00207.1. [DOI] [PMC free article] [PubMed]
  • 18.Weber D, Kinnear B, Kelleher M, et al. Effect of resident and assessor gender on entrustment-based observational assessment in an internal medicine residency program. MedEdPublish. 2021;11:2. 10.12688/mep.17410.1.
  • 19.Tsugawa Y, Jena AB, Figueroa JF, Orav EJ, Blumenthal DM, Jha AK. Comparison of hospital mortality and readmission rates for Medicare patients treated by male vs female physicians. JAMA Intern Med. 2017;177(2):206-213. [DOI] [PMC free article] [PubMed]
  • 20.Wallis CJD, Jerath A, Aminoltejari K, et al. Surgeon sex and long-term postoperative outcomes among patients undergoing common surgeries. JAMA Surg. 2023;158(11):1185-1194. 10.1001/jamasurg.2023.3744. [DOI] [PMC free article] [PubMed]
  • 21.Heath JK, Weissman GE, Clancy CB, Shou H, Farrar JT, Dine CJ. Assessment of gender-based linguistic differences in physician trainee evaluations of medical faculty using automated text mining. JAMA Netw Open. 2019;2(5):e193520-e193520. 10.1001/jamanetworkopen.2019.3520. [DOI] [PMC free article] [PubMed]
  • 22.Kogan JR, Dine CJ, Conforti LN, Holmboe ES. Can rater training improve the quality and accuracy of workplace-based assessment narrative comments and entrustment ratings? A randomized controlled trial. Acad Med J Assoc Am Med Coll. 2023;98(2):237-247. 10.1097/ACM.0000000000004819. [DOI] [PubMed]
  • 23.Calaman S, Hepps JH, Bismilla Z, et al. The creation of standard-setting videos to support faculty observations of learner performance and entrustment decisions. Acad Med J Assoc Am Med Coll. 2016;91(2):204-209. 10.1097/ACM.0000000000000853. [DOI] [PubMed]
  • 24.Cullen MJ, Zhou Y, Sackett PR, Mustapha T, Hane J, Culican SM. Differences in trainee evaluations of faculty by rater and ratee gender. Acad Med J Assoc Am Med Coll. 2023;98(10):1196-1203. 10.1097/ACM.0000000000005260. [DOI] [PubMed]
  • 25.Riese A, Rappaport L, Alverson B, Park S, Rockney RM. Clinical performance evaluations of third-year medical students and association with student and evaluator gender. Acad Med J Assoc Am Med Coll. 2017;92(6):835-840. 10.1097/ACM.0000000000001565. [DOI] [PubMed]
  • 26.Ginsburg S, Gingerich A, Kogan JR, Watling CJ, Eva KW. Idiosyncrasy in assessment comments: do faculty have distinct writing styles when completing in-training evaluation reports? Acad Med J Assoc Am Med Coll. 2020;95 (11S Association of American Medical Colleges Learn Serve Lead: Proceedings of the 59th Annual Research in Medical Education Presentations):S81-S88. 10.1097/ACM.0000000000003643. [DOI] [PubMed]
  • 27.Boyd R, Ashokkumar A, Seraj S, Pennebaker J. The Development and Psychometric Properties of LIWC-22. 2022. 10.13140/RG.2.2.23890.43205.
  • 28.Tausczik YR, Pennebaker JW. The psychological meaning of words: LIWC and computerized text analysis methods. J Lang Soc Psychol. 2010;29(1):24-54. 10.1177/0261927X09351676.
  • 29.Ginsburg S, Stroud L, Lynch M, Melvin L, Kulasegaram K. Beyond the ratings: gender effects in written comments from clinical teaching assessments. Adv Health Sci Educ Theory Pract. 2022;27(2):355-374. 10.1007/s10459-021-10088-1. [DOI] [PubMed]
  • 30.Linguistic Markers of Psychological Change Surrounding September 11, 2001 - Michael A. Cohn, Matthias R. Mehl, James W. Pennebaker, 2004. Accessed September 20, 2023. 10.1111/j.0956-7976.2004.00741.x. [DOI] [PubMed]
  • 31.Monzani D, Vergani L, Pizzoli SFM, Marton G, Pravettoni G. Emotional tone, analytical thinking, and somatosensory processes of a sample of italian tweets during the first phases of the COVID-19 pandemic: observational study. J Med Internet Res. 2021;23(10):e29820. 10.2196/29820. [DOI] [PMC free article] [PubMed]
  • 32.Madera JM, Hebl MR, Martin RC. Gender and letters of recommendation for academia: agentic and communal differences. J Appl Psychol. 2009;94(6):1591-1599. 10.1037/a0016539. [DOI] [PubMed]
  • 33.Gaucher D, Friesen J, Kay AC. Evidence that gendered wording in job advertisements exists and sustains gender inequality. J Pers Soc Psychol. 2011;101(1):109-128. 10.1037/a0022530. [DOI] [PubMed]
  • 34.The big two dictionaries: Capturing agency and communion in natural language - Pietraszkiewicz - 2019 - European Journal of Social Psychology - Wiley Online Library. Accessed September 20, 2023. 10.1002/ejsp.2561.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets during and/or analyzed during the current study available from the corresponding author on reasonable request.


Articles from Journal of General Internal Medicine are provided here courtesy of Society of General Internal Medicine

RESOURCES