Skip to main content
CBE Life Sciences Education logoLink to CBE Life Sciences Education
. 2025 Fall 1;24(3):ar35. doi: 10.1187/cbe.24-07-0193

Replication of an Intervention to Mitigate Gender Bias in Student Evaluations of Teaching Yields Variable Results Across a Biology Department

Lisa D Mitchem , Rachel L Rupnow , Collin P Jaeger §, Marissa N Pezdek , Brenda K Anak Ganeng , Karen E Samonds , Heather E Bergan-Roller †,*
Editor: Rebecca Price,
PMCID: PMC12415593  PMID: 40768626

Abstract

Student evaluations of teaching (SET) have repeatedly been shown to be biased against women instructors. Although few have been able to mitigate these biases, one team reported success in two courses by adding a short AntiBias statement to the beginning of SETs. We conducted a conceptual replication of that study to investigate the effectiveness of the AntiBias statement across a Department of Biological Sciences over three semesters. The AntiBias treatment inconsistently affected the SETs, sometimes improving women's scores but often not having any effect. Qualitative analysis showed that the types of comments students gave were mostly not affected by the conditions of treatment or instructor gender and were most frequently framed in positive connotation, implicitly about the instructor, and about course characteristics such as the logistics of the course. Our findings do not support the consistent replicability of the original work scaled to the department level yet shine an important light on SETs in the biology context. Moreover, this work suggests that a simple intervention to mitigate gender bias in teaching evaluations is not sufficient to remedy the multitude of issues with SETs. We discuss differences among studies and suggestions from the literature on ways to improve the evaluation of teaching.

INTRODUCTION

Student evaluations of teaching (SET) serve several functions in higher education institutions, including providing an avenue for students to feel heard, for instructors to gain a students’ perspective on how the course went, and for administrators to learn about students’ experiences in courses. It is often assumed that the information provided on SETs is then used by instructors to make changes to the course for future terms. Unfortunately, SETs often do not reflect actual teaching quality (Jean et al., 2022; Uttl, 2024) and are subject to implicit bias for instructors from marginalized groups (Kreitzer and Sweet-Cushman, 2022). This is problematic when SET responses are commonly used for evidence of teaching quality in merit, tenure, and promotion considerations—sometimes as the only source of information about an instructor's teaching (Wachtel, 1998; Artze-Vega et al., 2023). Moreover, although many studies examined biases in SETs across disciplines, noting that STEM (science, technology, engineering, and mathematics) disciplines often receive lower ratings compared with other disciplines (Watchel, 1998; Rosen, 2017), few studies have examined bias specifically within biology (but see Kendall and Schussler, 2013; Peterson et al., 2019).

Recent literature reviews demonstrate evidence of biases toward instructors of certain demographics in SETs across higher education along several dimensions (e.g., Gatwiri et al., 2021; Heffernan, 2021; Kreitzer and Sweet-Cushman, 2022; Khokhlova and Lamba, 2023). The most studied instructor identity for biases in SETs is gender limited to men and women. Women receive lower SETs than men instructors, whether the instructor is actually a woman or only perceived to be a woman, without the quality of teaching being different (Arbuckle and Williams, 2003; MacNell et al., 2015; Boring, 2017; Fan et al., 2019; Heffernan, 2021; Hoorens et al., 2021; Adams et al., 2022; Kreitzer and Sweet-Cushman, 2022).

The presence of bias toward instructor gender has been investigated using both quantitative and qualitative approaches. Quantitative analysis is a common way of interpreting and researching SETs. Students answer prompts using Likert scale-like questions, which are often transformed into continuous data. These data can be used to inform teaching practices, rank instructors, and investigate influences on these values for evidence of biases. Several studies have used a quantitative approach to examine biases in SETs (e.g., Peterson et al., 2019; Hoorens et al., 2021; Genetin et al., 2022; Kogan et al., 2022; Sigurdardottir et al., 2022; Aragón et al., 2023; Foster, 2023). Complimentarily, SETs usually include an open-response section, where students write out comments and opinions that cannot be captured with a simple scale rating. Qualitative analysis of the language that students use to describe their instructors often demonstrates biases in more nuanced ways. For example, women instructors were more often described as caring, nurturing, supportive, enthusiastic, and relatable, while men instructors were described as content experts, funny, professional, and challenging (Bachen et al., 1999; Sprague and Massoni, 2005; Adams et al., 2022; Sigurdardottir et al., 2022). Analyzing quantitative Likert-scale ratings and qualitative descriptions and connotations conjointly lends important insight into the information being passed back to instructors and administrators about students’ perceptions in the classroom.

The discipline also plays a part in how instructors are rated by students. In general, natural science-based instructors receive lower scores than instructors in the social sciences or humanities (Rosen, 2017; Heffernan, 2021), but there is variation in how instructors are rated among science fields. Certain fields of science, such as sociology and psychology, are more likely to be examined in research on biases in SETs (Arbuckle and Williams, 2003; Sprague and Massoni, 2005; MacNell et al., 2015; Boring et al., 2016; Hoorens et al., 2021), while other disciplines, such as natural sciences or STEM, are grouped into larger categories (Bachen et al., 1999; Basow, 2000; Centra and Gaubatz, 2000; Reid, 2010; Storage et al., 2016; Boring, 2017; Heffernan, 2021; Kogan et al., 2022; Kreitzer and Sweet-Cushman, 2022). In biology, most studies focus at the institutional or national level and thus investigate how biology compares with other disciplines and not nuances within the field (e.g., Terkik et al., 2016; Rosen, 2017). Only a few studies have focused on comparisons within biology. One study used a biology course to investigate SETs but grouped data also collected from a politics course (Peterson et al., 2019). Another study found that women biology instructors were rated more organized and uncertain than men instructors (Kendall and Schussler, 2013). Moreover, some have posited that the degree of gender bias may depend on the discipline (Storage et al., 2016; Fan et al., 2019). Therefore, more work needs to be done to understand the nuances of bias within disciplines, including biology.

Despite the clear need to improve SETs for their output to effectively improve teaching practices and student success (National Academies of Sciences, 2020), comparably little has been done to investigate mitigating the effects of biases in SETs. In many cases, interventions are laborious and may be out of the proximate control of the department (e.g., gender composition). Some techniques for mitigating biases against instructors of certain demographics in SETs involve incorporating student reflection (Hoorens et al., 2021; Owen et al., 2024). In a study on the effects of self-affirmation on mitigating biases against instructors based on perceived threats of bad grades, students were more critical of hypothetical male professors and rated them similar to female professors when asked to identify their own values and personal traits (Hoorens et al., 2021). Owen et al. (2024) took a similar approach of student reflection by adding a reflection question to an experimental SET treatment. The reflection question asked students to identify their own criteria for effective teaching and then evaluate instructors based on those criteria (Owen et al., 2024). The student teaching criteria reflection within the SET had no effect on biases against female faculty (Owen et al., 2024). Strategies to reduce bias that lengthen the time needed to complete SETs can be laborious for students, which can reduce student response rates (Nulty, 2008; Cone et al., 2022). Moreover, the strategies themselves are not always effective at mitigating biases (Owen et al., 2024).

Peterson et al. (2019) proposed a simple intervention to address the issue of more laborious bias mitigation interventions in SETs. The intervention—adding a short statement at the beginning of the SET form—has been shown to mitigate gender biases in SETs (Peterson et al., 2019). The statement was structured based on evidence from social psychology for overcoming implicit biases, specifically by alerting students to this bias, motivating them to change, and giving ways to help them overcome the implicit bias. When the authors randomly assigned half of the students to view this AntiBias statement for four instructors in biology and American politics courses, the SETs with the AntiBias statement had increased scores for the women instructors, while scores of the men instructors were unaffected. The authors concluded that this simple AntiBias statement mitigated gender biases in these SETs. This study garnered national attention (Flaherty, 2019) and spurred conversation about ways to improve SETs. Furthermore, the study was replicated by others, including across a college (Genetin et al., 2022) and university (Kogan et al., 2022). The findings from these replications, however, did not show evidence of bias mitigation like the Peterson et al. (2019) study. Genetin et al. (2022) found no effect of an added implicit bias statement on SET scores and that the introduction of the statement itself discouraged SET participation from students. An implicit bias statement also had no effect on differences in SET scores between students of differing grade satisfactions (Kogan et al., 2022). Both Genetin et al. (2022) and Kogan et al. (2022) note the importance of replicating the implementation of antibias SET statements in different contexts. Genetin et al. (2022) specifically note the importance of assessing qualitative, written student feedback in future studies.

No work, to our knowledge, has been done to replicate Peterson et al.’s (2019) findings in a departmental context. Study replication is a key tenet of scientific discovery that leads to more accurate understandings of the world and is endorsed by the National Science Foundation (2018) (NSF) and U.S Department of Education Institute of Education Sciences (IES) (NSF and IES, 2018). We adopted the definition of replicability from The National Academies of Sciences, Engineering, and Medicine (2019) to mean investigating the same scientific question but with new data and similar methods as the previous study. Furthermore, a conceptual replication study allows us to understand the generalizability of the original findings to better understand contextual factors that may affect the effectiveness of the intervention (NSF and IES, 2018).

For this study, we investigated how a short AntiBias statement, previously shown to mitigate gender bias in SETs, affected students’ evaluations of teaching across a biological sciences department with two specific research questions (RQ):

  • RQ1

    Are there differences in how students rate courses and teaching based on an AntiBias statement and instructor gender?

  • RQ2

    How do students differ in their comments on SETs based on an AntiBias statement and instructor gender?

This work is novel and extends our understanding of how the AntiBias statement intervention affects SETs in several ways. First, this work is situated in the context of a department of biological sciences. Previous work has investigated the efficacy of the AntiBias statement in a couple of courses (Peterson et al., 2019) or the effectiveness across a college (Genetin et al., 2022) and university (Kogan et al., 2022). The departmental context is meaningful: individual instructors often do not have control over the content of their SETs as these are often governed at the department, college, or university levels. Moreover, grassroots reforms are more likely to be first implemented at the department scale than higher levels of the university system. Additionally, the disciplinary context of biology is important. Although the original study (Peterson et al., 2019) included two sections of one biology course, it combined data with an American politics course and therefore provides little insight into how the findings may generalize within a specific discipline. Teaching and learning are unique within disciplines (National Research Council, 2012), and therefore, controlling for the discipline is advantageous for understanding the boundaries of this intervention. Further, this work takes a mixed methods approach to understand multiple dimensions of how students report their experiences and multiple factors influencing those reports. Our study manipulates only the inclusion of the AntiBias statement in SETs, thereby testing efficacy of the mitigation technique itself without additional interventions. Together, this work provides a rich and nuanced understanding for how the AntiBias statement intervention affected student responses on SETs in order to help inform fair evaluation of teaching.

MATERIALS AND METHODS

This work was conducted with approval from the institutional review board at Northern Illinois University (#HS21-0348).

Study Context

This study was conducted in a Department of Biological Sciences at a large 4-y, doctoral-granting, regional comprehensive university in the Midwestern United States during the terms of Spring 2021, Fall 2021, and Spring 2022. Courses included in this study varied in their target populations, modality, course size, level, and majors served (Table 1). The sample population included undergraduate only, dual serving undergraduate with graduate, and graduate only courses. Modality for courses were face-to-face, hybrid, online asynchronous, and online synchronous. Course sizes ranged from small (<20 students) to extra-large (>100 students) (see Table 1 for additional size groupings). Course level was divided into introductory, intermediate, and upper where introductory courses teach beginning-level biology content (100–200 level), upper courses teach advanced biology content (400+ level, including graduate courses), and intermediate courses are composed of the courses between beginning and advanced (300 level). Generally, courses that served undergraduate students in upper-level courses were the largest sample across semesters. Course modality was determined by the instructor; this led to online synchronous courses being the largest sample in Spring 2021 due to the COVID-19 pandemic, whereas more face-to-face classes were offered in Fall 2021 and Spring 2022, and therefore are more represented in those semesters. Small courses were the largest sample in Spring 2021; representation from all the course sizes was more even in Fall 2021 and Spring 2022. Most of these courses served primarily biology majors and minors each semester, who are described further below.

TABLE 1.

Characteristics of courses in which students completed SETs for this study.

Spring 2021 Fall 2021 Spring 2022
Course characteristics n % n % n %
Courses 24 18 19
Population Undergraduate 14 58% 10 56% 11 58%
Undergraduate/graduate 7 29% 6 33% 7 37%
Graduate 3 13% 2 11% 1 5%
Modality Face-to-face only 2 8% 15 83% 13 68%
Hybrid 1 4% 0 1 5%
Online Synchronous 13 54% 0 1 5%
Online Asynchronous 8 33% 3 17% 4 21%
Course size Extra-large (100+) 5 21% 4 22% 1 5%
Large (50–99) 3 13% 4 22% 7 37%
Medium (20–49) 5 21% 5 28% 5 26%
Small (under 20) 11 46% 5 28% 6 32%
Level Introductory 5 21% 5 28% 4 21%
Intermediate 7 29% 5 28% 6 32%
Upper 12 50% 8 44% 9 47%
Majors served Primarily Biology 19 79% 13 54% 16 67%
Sciences and Health 5 21% 4 17% 3 13%
General Education 0 1 4% 0

Note: Numbers represent the number of courses with each characteristic and the percentage of courses with that characteristic for that semester. Numbers in parentheses in course size represent the number of students enrolled.

During this time, the department served over 400 students per semester, including 378 majors and 59 students with minors in Biological Sciences (Supplemental Material S1) as well as students with other science foci such as Environmental Sciences, Clinical Laboratory Sciences, Prephysical Therapy, Prenursing, and nonscience majors. As most of the courses included in this study served primarily biology students, we describe their general demographics here and in Supplemental Material S1. There were more females than males (65–66% female); they were most commonly White non-Hispanic followed by Black Non-Hispanic and Hispanic (White, 45–49%; Black Non-Hispanic, 18–19%; Hispanic, 20–23%); they averaged 22 y of age; and about half were first-generation, transfers, or Pell-eligible. These data were collected through the institution's Office of Institutional Effectiveness as aggregate, unidentifiable information that cannot be linked to individual SETs. As such, at the time of this study, the institution only had students’ genders recorded as either “female” or “male.” We recognize that these terms are most related to sex assigned at birth and not gender, that genders are much more diverse than this binary, and that survey forms should include more inclusive options (Cooper et al., 2020). Similarly, race and ethnicities are also oversimplified and combined in the institution's data despite being separate and unique constructs. However, we kept the terms the institution gave us as this was the best data available to us and so that we did not make any further assumptions.

The instructor sample (N = 17) was mostly White, represented two genders nearly equally, ranged in rank (Table 2), and was reflective of the department's instructional population. We achieved an 85% inclusion rate of instructors, which allowed for an inclusion rate of 89% of the SETs. Only White, non-Hispanic is represented in Table 2 to protect individual instructors from being identified. We use the term “Instructor” throughout this article to describe those teaching the courses that these SETs assess regardless of their academic rank or position (e.g., tenured, tenure track, or nontenure track).

TABLE 2.

Demographics of the 17 instructors whose courses and SETs were included in these data.

Instructor identity N
Men 8
Women 9
White Non-Hispanic 14
Non-Tenure Track Instructor or Postdoc 4
Tenure Track Assistant Professor 4
Tenure Track Associate Professor 3
Tenure Track Full Professor 6

Departmental SETs

Students completed departmental SETs electronically through Qualtrics (Qualtrics, Provo, UT). The Qualtrics form (Supplemental Material S2) first asked students to identify the course they intended to evaluate, then were given an instructions page with either the Original or AntiBias statement (i.e., treatment, Table 3). At the bottom of the instructions page, students had to click a button that read, “Ok, I read and understand these instructions” in order to proceed to the next page. The following page asked students to identify the instructor of the course from a list and then give some information about themselves, including their class rank, grade point average (GPA) category, average hours worked per week on that course, and expected grade in that course. Students were not asked for demographic information per university policy. On the next page, students were given five statements “regarding the course lectures and exams” and asked to rate each statement on a 5-point Likert-type scale of excellent (5), good (4), average (3), below average (2), and poor (1). On the next page, students were given another five statements “regarding the course instructor,” including “My overall rating of the instructor is:” and again were asked to rate each statement on the 5-point Likert-type scale. On the last page, students were given a final closed-response question, “My overall rating of this course is:” to rank with the same 5-point Likert-type scale and three open-response questions of “Please indicate, with regards to the ONLINE implementation of this class (if applicable), what you particularly liked, disliked or what can be improved,” “Please indicate what you particularly like or dislike about this course, aside from the online aspect of this course (if applicable),” and “What would you suggest to improve this course?” The first open-ended question regarding the online implementation of the course would only be answered by students in online or hybrid courses whereas a student in face-to-face, hybrid, or online could provide responses to the second and third open-ended questions. The question wording and order were the same across the semesters of this study. Students completed these SETs anonymously per university policy.

TABLE 3.

The two treatment statements used in this study where students randomly received one of the statements on the second page of their SET in Qualtrics.

Original AntiBias
University policy requires that a student evaluation of instructors be given in this course. Results are considered by the department personnel committee in the process of determining merit ratings. This affects serious decisions concerning salary, promotion, and tenure. Your instructor will also consider the results in their future course preparation. Student evaluations of teaching play an important role in the review of faculty. Your opinions influence the review of instructors that takes place every year. Northern Illinois University recognizes that student evaluations of teaching are often influenced by students’ unconscious and unintentional biases about the race and gender of the instructor. Women and instructors of color are systematically rated lower in their teaching evaluations than white men, even when there are no actual differences in the instruction or in what students have learned.
Please complete this evaluation in a thoughtful manner. All responses are completely confidential, and will be helpful in determining future implementation of this course. As you fill out the course evaluation please keep this in mind and make an effort to resist stereotypes about professors. Focus on your opinions about the content of the course (the assignments, the textbook, and the in-class material) and not unrelated matters (the instructor's appearance).
All responses are completely confidential, and will be helpful in determining future implementation of this course.
74 words 144 words

The SETs were disseminated to students in the last 2 wk of the course in-person and/or through the learning management system. For in-person dissemination, a graduate teaching assistant unrelated to the course provided students with a link to the Qualtrics SET form for students to fill out during ∼10 min of class time while the instructor was out of the room. For all courses, regardless of modality, instructors were emailed the link to the SET and an example prompt to post to the learning management system that encouraged the students to complete the SET.

Study Design

This study used a 2 × 2 design to investigate the effects of altering the introductory statement in SETs when considering the gender of the course instructor. We use only “men” and “women” as categories for instructor gender because those were the only genders the instructors self-reported to us (Table 2; described further in the Data Sources section below). Students were randomly assigned to either receive the original SET (Original treatment) or the same evaluation with the short AntiBias statement created by Peterson et al., (2019) (AntiBias treatment, Table 3). The Original treatment was taken from the department's preexisting SET and was unchanged for this experiment. The AntiBias treatment differed only in the replacement of the original SET statement with an AntiBias statement (Table 3). Randomization was achieved using the Randomizer function in Qualtrics. We use “condition” herein to describe the context of the SETs in terms of both the instructors’ gender and treatment (Original or AntiBias statements).

Furthermore, we used concurrent triangulation mixed methods (Warfa, 2016), with primacy to the quantitative, to investigate multiple ways that conditions could have affected students’ responses on SETs. Specifically, the quantitative analysis addressed whether there were differences among the conditions of treatment and instructor gender. The qualitative analysis addressed how students commented on teaching and courses in SETs.

Data Sources

SETs were the source of the dependent variables for all analyses. Only SETs pertaining to courses taught by instructors for whom we had a gender identification were included in this research. We collected instructors’ genders by asking instructors to self-identify and consent to being included in this research (Supplemental Material S3). We also aligned this information with publicly available records of instructors’ genders such as pronouns listed on faculty websites. Only the genders of “man” and “woman” were reported by the instructors, despite being provided more options and open response options; therefore, we use only these two genders herein, although we recognize that gender is much more diverse than this binary. Course characteristics (summarized in Table 1) were collected through the course schedules posted on the Institutional Registration and Records website. Demographics of students served by the Department of Biological Sciences (Supplemental Material S1) were collected through the Office of Institutional Effectiveness as aggregate, unidentifiable information.

Analysis

SETs were analyzed quantitatively and qualitatively. Only submitted SETs with fully completed closed-response questions in which students correctly matched the course and instructor were included. We did not test the department's SET for evidence of validity or reliability because it is an established tool that had been used by the department for over 30 years, and the aim of this research was to evaluate where the previously implemented treatment (i.e., AntiBias statement) might impact SETs for instructors of different genders when applied at the department scale without other changes.

Quantitative Analysis on Closed-Ended Questions

Quantitative analysis was conducted on all semesters of the data: Spring 2021 (n = 186 SETs), Fall 2021 (n = 405), Spring 2022 (n = 236). Student responses to closed-ended questions on the 5-point Likert-type scale were converted to numeric values of excellent (5), good (4), average (3), below average (2), and poor (1). Analysis was conducted on (1) the overall average score of all questions per SET, (2) responses to the individual questions of “My overall rating of the instructor is:,” and (3) “My overall rating of this course is:.” This selection is reflective of the questions and sets analyzed in Peterson et al., (2019). Additionally, we grouped questions based on their phrasing. The first set of questions were “regarding the course lectures and exams,” which in this department is the responsibility of the instructor; therefore, we labeled these questions as “Implicit” as they are implicitly about the instructor. Conversely, the next set of five questions were explicitly “regarding the course instructor” and thus labeled “Explicit.” We then analyzed all questions separately to determine the driving questions for any differences within the “Implicit” and “Explicit” labeled question groups.

Due to differences in course characteristics among semesters (Table 1), we assessed differences among conditions separately for each semester. We used linear models to assess differences among conditions for the outcome variables described above: overall average, instructor rating, course rating, implicit questions, explicit questions. The predictor variables were the treatment (Original or AntiBias statement) and instructor gender (Man or Woman). Because the effect of the treatment may depend on the gender of the instructor, as seen in Peterson et al., (2019), we tested for an interaction between these variables. Models were analyzed using the car package (Fox et al., 2023) in R version 4.4.0 (R Core Team, 2024). Standard effect sizes are not applicable to models with a significant interaction (Maxwell et al., 2018) and therefore not reported. When a significant interaction was present, we conducted post hoc pairwise comparisons using estimated marginal means with the “emmeans” package (Lenth et al., 2020). When there was a significant main effect but no interaction, we calculated the standardized effect size of Cohen's d (Maher et al., 2013) using the effsize package (Torchiano, 2022).

Furthermore, to test the effectiveness of our AntiBias treatment distribution on randomness of student sampling, we compared frequencies of student self-reported information (i.e., class rank, GPA), average hours worked per week on that course, and expected grade in that course) among the treatments using the χ2 analysis in R version 4.4.0 (R Core Team, 2024).

Qualitative Analysis on Open-Ended Questions

Qualitative analysis was conducted on students’ open responses to the following prompts on data from Spring 2021: “Please indicate, with regards to the ONLINE implementation of this class (if applicable), what you particularly liked, disliked or what can be improved,” “Please indicate what you particularly like or dislike about this course, aside from the online aspect of this course (if applicable),” and “What would you suggest to improve this course?” The unit of analysis was the student—we analyzed students’ responses to all questions together because we did not want to double count students making the same point in response to different questions. We coded all responses in which students provided codable, substantive responses to the open-ended questions; therefore, responses of “n/a,” “nothing,” “everything,” and similar were removed. This resulted in 180 individual student responses, which differs slightly from the 186 completed closed-response used for qualitative analysis in the Spring 2021 semester, as some students completed only the closed-response questions and left the open-response questions blank.

Three coders analyzed each student response iteratively following advice by Anfara et al., (2002). The coders included a biology education research student, biology education researcher, and math education researcher, which provided a variety of perspectives on student responses, though we acknowledge our coding ultimately conveys the researchers’ interpretations of students’ responses. The math education researcher, who was not able to easily associate names of people within the Department of Biological Sciences, replaced all instructor names in the data with pseudonyms (starting with W for women instructors and starting with M for men instructors) before coding began. Hereafter, any instructor names cited in student quotes will be referred to by their designated pseudonym.

The three coders first tested a subset of 50 student responses to create the codebook. The biology education research student and biology education researcher examined 25 student responses each and aligned their codes to the math education researcher who examined all 50 student responses. The initial iteration of coding focused on categorizing course and instructor characteristics; however, coders found it challenging to disentangle which factors were attributed to the course versus the instructor in the student responses.

Over the course of six versions of the codebook, the coders concluded three-part coding could best capture nuances in the data. The three-part coding included (1) explicit versus implicit statements, (2) course and instructor characteristics, and (3) connotation. Explicit versus implicit codes identified the instructor's agency with respect to the statement. Course characteristics included codes of logistics, course quality, and online experience, instructor's teaching and instructor's caring. The coders identified statements as having positive, negative, or neutral connotations. An example of a single 3-part code for the phrasing-aspect-connotation would be “Explicit-Logistics-Positive” if a student explicitly spoke about the instructor (explicit) setting up a logistics-related factor (logistics) and spoke of this in a positive way (positive). The subcodes are defined and exemplified in the Results. Each student response could receive multiple 3-part codes. The student response quotations in the Results are unedited unless denoted by brackets (e.g., to blind the institution) though they may be shortened to permit discussion of one coded segment at a time. Once the codebook was finalized, each student response was independently analyzed by two coders, and codes for each student response were reviewed by all three coders before reaching consensus.

Frequencies of Qualitative Codes

We analyzed the frequency of qualitative codes among Antibias and instructor gender conditions to assess how students differed in their comments among treatments. We utilize our assessment of code frequencies as a secondary, explanatory analysis to further elucidate the results from our quantitative findings (Warfa, 2016; Czocher and Melhuish, 2024). We assessed differences in overall code count frequencies among treatment and instructor gender conditions using χ2 tests. We did not statistically assess difference in three-part code counts among the treatment and instructor gender conditions due to small sample sizes (see sample sizes in Supplemental Material S6); however, we used χ2 tests to assess differences among conditions for individual code components (i.e., explicit, implicit, positive, negative, etc.).

Positionality Statement

We acknowledge that this work was informed by our positionalities, including our disciplinary expertise, gender and racial identities, and departmental contexts. Specifically, the team includes two students, one postdoc, and four faculty with content expertise in biology, discipline-based education research, and math. Our methodological expertise includes quantitative and qualitative approaches. Our gender identities include woman, man, and nonbinary. Six of the authors identify as White, and one author identifies as Southeast Asian. All authors have a connection to the departmental context of the study: three authors are members of that department, one author did their graduate work in the department and was previously employed there, one author was previously an undergraduate student and currently a graduate student in the department, one author was previously a graduate student in the department, and another author is in a different department within the same college. We aimed to leverage our own intersectional positionalities to understand these data while promoting open and thorough communication among the research team throughout the project.

RESULTS

Balance of Conditions

We evaluated the balance of SET student response distribution among the treatment (Original or AntiBias statement) and instructor gender (Man or Woman) conditions to test the randomization process for SETs (Supplemental Material S4). The number of student responses did not vary by condition for all the semesters combined, χ2 (1,827) = 0.51, p = 0.48, or per semester, p > 0.05. Furthermore, for each semester, we compared students’ self-reported characteristics of class rank, GPA category, average hours worked per week on that course and expected grade in that course between the two treatments (Original or AntiBias statement) using χ2 tests. There was no difference between the number of SETs per treatment based on student rank, GPA category, hours worked on course category, or expected grade category (Table 4). Therefore, the randomization appeared successful.

TABLE 4.

The χ2 test results for balance of conditions for student self-reported characteristics distributions among the Antibias treatment and instructor gender conditions.

Spring 2021 Fall 2021 Spring 2022
Student Characteristics χ 2 df p χ 2 df p χ 2 df p
Class Rank 2.31 4 0.6 1.38 4 0.85 6.29 4 0.18
GPA 3.1 4 0.54 2.02 4 0.73 0.65 4 0.96
Hours worked on course 4.18 4 0.43 5.55 4 0.24 3.86 4 0.43
Expected grade 2.03 3 0.57 0.87 4 0.93 0.84 3 0.84

Quantitative Findings from Closed-Ended Questions

Overall Average Ratings

The AntiBias treatment had an inconsistent impact on students’ closed-response SETs. In the Spring 2021 semester, there was no significant effect of Antibias treatment, instructor gender, or interactions (Supplemental Material S5). In the Fall 2021 semester, there was a significant interaction between treatment and instructor gender for overall average rating [F(1,401) = 4.96, p = 0.03]. Students gave the highest scores when they received the AntiBias treatment and were reporting on courses with women instructors in the Fall 2021 term (Figure 1). Students who received the AntiBias treatment scored women instructors significantly higher than students who received the AntiBias treatment scoring men instructors (Supplemental Material S5). In the Spring 2022 semester, there was a significant effect of instructor gender on SET average overall ratings [F(1,232) = 3.80, p = 0.05]. Students gave higher scores to women instructors in this semester (Figure 1).

FIGURE 1.

FIGURE 1.

Overall averages of SETs separated by treatment, instructor gender, and semester. For each condition, plots show the mean ± one SE (black) with individual data points in faded yellow (Original statement) and green (AntiBias statement). Letters denote statistically significant differences among conditions.

Instructor Explicit Question Average Ratings

In the Fall 2021 term, there was a significant interaction between treatment and instructor gender for average rating of instructor explicit questions [F(1, 401) = 6.14, p = 0.01]. Students gave the highest scores when they received the AntiBias treatment and were reporting on courses with women instructors (Figure 2; Supplemental Material S5). Students who received the AntiBias treatment scored women instructors significantly higher than students who received the AntiBias treatment scoring men instructors (Supplemental Material S5). For questions explicitly about the instructor, differences among treatment and instructor gender in Fall 2021 were driven by the questions “The instructor's availability to students outside of class was:” [F(1,401) = 7.82, p < 0.01], “The instructor's encouragement of students to ask questions and the clarity of response to questions was:” [F(1,401) = 5.32, p = 0.02], and “The instructor's verbal and written presentation and ability to organize explain clearly the basic concepts of this course was:” [F(1,401) = 5.20, p = 0.02] (Supplemental Material S5).

FIGURE 2.

FIGURE 2.

Average student responses to questions explicitly about their instructor separated by treatment, instructor gender, and semester. For each condition, plots show the mean ± one SE (black) with individual data points in faded yellow (Original statement) and green (AntiBias statement). Letters denote statistically significant differences among conditions.

For Spring 2021 and Spring 2022, there were no significant interaction between Antibias treatment and instructor gender for average rating of instructor explicit questions. However, women instructors were rated higher than men for Spring 2021, F(1,182) = 9.28, p = < 0.01, Cohen's d = 0.44, and Spring 2022, F(1,232) = 6.41, p = 0.01, Cohen's d = 0.38 (Figure 2). These differences between instructor genders for questions phrased explicitly about the instructor were driven by the following questions: “The instructor's ability to provide an atmosphere of interest and enthusiasm in the subject was:” (Spring 2021: [F(1,182) = 10.95, p < 0.01], Spring 2022: [F(1,232) = 6.72, p = 0.01]), “The instructor's availability to students outside of class was:” [F(1,182) = 6.41, p = 0.01], “The instructor's encouragement of students to ask questions and the clarity of response to questions was:” (Spring 2021: [F(1,182) = 10.31, p < 0.01], Spring 2022: [F(1,232) = 7.61, p = 0.01]), and “The instructor's verbal and written presentation and ability to organize explain clearly the basic concepts of this course was:” (Spring 2022: [F(1,232) = 5.62, p = 0.02) (Supplemental Material S5).

Instructor Implicit Question Average Ratings

There were no significant interactions or main effects on questions phrased implicitly about the instructor across all three semesters (Supplemental Material S5).

Instructor Overall Rating

In the Fall 2021 term, there was a significant interaction between Antibias treatment and instructor gender for instructor overall rating [F(1,401) = 5.49, p = 0.02]; however, there was no interaction or treatment effect in the Spring 2021 or Spring 2022 semesters. Students who received the AntiBias treatment scored women instructors significantly higher than students who received the AntiBias treatment scoring men instructors in the Fall 2021 term (Figure 3, Supplemental Material S5). In the Spring 2021 semester, students gave women instructors higher instructor ratings compared with men F(1,182) = 6.29, p = 0.01, Cohen's d = 0.36 (Figure 3).

FIGURE 3.

FIGURE 3.

Student responses to “My overall rating of the instructor is…” separated by treatment (Original statement in yellow, AntiBias statement in green), instructor gender, and semester. Letters denote statistically significant differences between instructor gender.

Course Overall Rating

Women instructors had a higher course rating (Figure 4) during two of the three semesters: Fall 2021, F(1, 401) = 5.2138, p = 0.02, Cohen's d = 0.22; and Spring 2022, F(1,232) = 4.93, p = 0.03, Cohen's d = 0.35.

FIGURE 4.

FIGURE 4.

Student responses to “My overall rating of the course is…” separated by treatment (Original statement in yellow, AntiBias statement in green), instructor gender, and semester. Letters denote statistically significant differences between instructor gender.

Qualitative Findings: Themes from Open-Ended Questions

In this section, we characterize students’ responses qualitatively. We provide representative examples of each code in Tables 5, 6, and 7, related to students’ phrasing of comments, emphases on course or instructor characteristics, and the connotation of each comment, respectively. We then expand on the range of responses aligned with each code before examining frequencies of codes in the next section. For clarity, we italicize the names of codes in text.

TABLE 5.

Phrasing codes, highlighting students’ emphasis on the actor in comments.

Code Code Definition Example
Instructor Explicit Student remarks about what the instructor did in the class, including how well they taught, entertainment value of instruction, and ways the instructor did or did not display empathy. “I really liked the topics that were covered within this class, and I thought that Dr. Willa did a really good job at teaching this class.”
Instructor Implicit Student remarks on aspects of the course that could be perceived as related to the instructor's choices but are not clearly linked by the student's response (e.g., length of exam, course modality, course content); includes seemingly addressing the instructor indirectly. “The lecture materials were interesting to learn. Lectures and labs sometimes felt overwhelming because I was not sure which information to focus on and study, so it felt like there was a lot more detail to study for than was perhaps needed for the exams. Apart from that, I enjoyed the real-world connections that were made between the lecture material and clinical cases.”
Other Student remarks on aspects of the course that clearly fall outside the instructor's purview (e.g., building construction, specifically calling out the institution). “Purchase better voice recording equipment for the instructors (microphones)”

TABLE 6.

Instructor and course characteristics codes, definitions, and examples.

Category Code Code Definition Example
Course Characteristics Logistics Student comments on course set-up, modality, homework, and/or exam features. “I liked that you could check the lectures whenever you want, but I did not like the timing for the exams, it was not enough.”
Course quality Student expresses appreciation (or lack thereof) for the material in the course (not instruction) and utility (or lack thereof) of the material to future career plans and learning. “I love the application of this course to the real world.  I feel as though I learned a lot of things that will go a long way in pursuit of my career in biology.”
Online general Unspecified comments on nature of online course format. “Online made group discussions hard at times”
Instructor Characteristics Instructor's teaching Student comments on instructor's in-class instruction (or appropriate analogue), including clarity of lecture, instructor's enthusiasm, choice of in-class instructional decisions, and clarity of expectations. “I enjoyed the video lectures my professor did since they helped me understand the material a lot more than the book or presentation slides alone.”
Instructor's caring Student comments on instructor's demonstrations of understanding students’ situation and attention to students’ needs (or not). “I really like how the professor took off the last two [Perusall assignments]. it showed she understood how difficult this semester has been for the students and i really appreciate it.”
Other Other Instructor/course characteristic unclear but other aspects of response clear. “Nothing [to suggest changing] Wanda has been the BEST TEACHER I HAVE EVER HAD.”

TABLE 7.

Connotation of comment codes, definitions, and examples.

Code Code Definition Example
Positive Student frames remarks about the instructor, course, or other characteristic in a positive way. “I liked how our professor posted the lecture videos in advance so we had more time in the week to get a head start on watching them and taking notes.”
Negative Student frames remarks about the instructor, course, or other characteristic in a negative way. “tests were difficult study guides not very helpful.”
Neutral Student frames remarks about the instructor, course, or other characteristic in both positive and negative ways or connotation is unclear. “The weekly schedule” [full response to “Please indicate what you particularly like or dislike about this course, aside from the online aspect of this course (if applicable).”]

Explicit and Implicit

Phrasing codes characterized how students framed their comments—whether a particular comment seemed to acknowledge the instructor as the one with agency over the aspect or whether it seemed to be perceived as something that the instructor did not control or decide. Instructor explicit comments sometimes directly mentioned the instructor's name (as in Table 5’s example) but did not have to. They could refer to the “teacher,” “instructor,” or “professor,” such as: “I enjoyed the video lectures my professor did since they helped me understand the material a lot more than the book or presentation slides alone.” Some used pronouns, such as: “her structure of the class was very unreliable; her exams are ridiculously hard, I did not enjoy her class.” Instances where the student referred to the instructor (by name, pronoun, or as professor/instructor/teacher) were viewed as explicit because they credited some activity directly to the instructor. Cases where we viewed the most sensible interpretation as for the instructor to be engaged in the activity were also coded as explicit, such as: “I enjoyed the one on one virtual meetings to discuss your progress and thoughts about the class. Some ideas and suggestions about how the class could be improved were sometimes discussed during these personal meetings.” Because the sensible person to be engaged in these meetings was the instructor and a meeting requires two or more people, we viewed this as recognition that the instructor had agency over the activity in question.

What we termed instructor implicit comments focused on parts of courses that were likely influenced, if not completely controlled, by the instructor, but that the students did not seem to attribute to the instructor's choices. These comments often included passive voice and focused on course aspects without reference to people (as in the example in Table 5, where the lectures, labs, and clinical materials are seemingly permanent fixtures of the course without reference to an instructor). Similarly, some examples had no subject of the sentence, such as “Wish that some of the lectures were recorded or there were live lectures where we could ask more questions” but the context could implicitly be related to the instructor's choices. Responses to the third open-response question (i.e., “What would you suggest to improve this course?”), focused on suggestions to improve the course that seemed directed to the instructor but not by name were still considered implicit, such as: “Less questions on the exams so I actually have time to read and comprehend them.” In such cases, it seemed that the student was giving feedback to the instructor, but was trying to do so indirectly, so we still viewed the comment as implicit.

Student responses that did not seem related to the instructor's choices in any way were considered other phrasing. For instance, this student directly noted their critique was at the institutional level: “The software available to [institution] for online classes is not very good or intuitive (blackboard, office 360, etc.). This unfortunately spills over into every single class, including this one. If anything were to be improved, it would be the software that is used for these online classes.” A similar exhortation to the institution seems to be the focus of the example in Table 5. Other instances seemed focused at the course level but also ignored the instructor, such as, “The online implementation of this course was good. The audio quality was good, and there were never any issues with connection. I don't believe that the in-person aspect would have provided anything more valuable. To me, it was the same experience, and it was a good one.” Here, the focus of the comment seems to be on the modality and the technology rather than on the instruction or instructor.

Instructor and Course Characteristics

Instructor and course characteristics codes were assigned to characterize what the students were commenting on—aspects related to the course or aspects related to the instructor's presentation of the course. Course characteristics included logistics, course quality, and online general. We assigned the code logistics to aspects focused on timing, modality, and syllabus information, such as the example in Table 6 and the following: “Also, the professor does not accept late work; I turned in an assignment 9 min late and received no credit even though I spent hours working on it. I understand that deadlines are deadlines, however, I am pretty sure no one is grading homework at 11:59 pm.” Note, this example credits the instructor with the late work policy, while the example in Table 6 is written more implicitly about lecture and exam choices. We coded course quality to characterize an emphasis on the course material as might be described in the course catalogue, as well as whether that course seemed useful for their major or future career. The example in Table 6 emphasized the applicability of the course and utility for their future career. In contrast, the following student's response highlighted a lack of expected applicability: “It covers a broad range of topics, but unfortunately, the type of analysis I need to do for my thesis wasn't covered (GLMMs) [generalized linear mixed model]. It was a required class so I had to take it to graduate, but I was also expecting to learn the skills I need for my thesis.” Responses coded online general largely focused on limitations or affordances of being in an online setting without clearly linking the online environment to other characteristics. The example in Table 6 emphasizes the difficulty in communication that was caused by the online environment but seems focused on the nature of being online more than logistics, instruction, or other components of teaching. In contrast, this student's response emphasizes the positive organizational aspect of being online and having all course materials online, implicitly, rather than having to keep track of notes, textbooks, and other learning materials in different formats: “Like many of my other online classes, having everything in one spot is pretty nice.”

Instructor characteristics included instructor's teaching and instructor's caring. Instructor's teaching included characterizations of what the instructor would do during class or videos they made for asynchronous learning. For instance, the example in Table 6 showcases the utility of lecture in helping the student understand the material. In contrast, the following example seemed to view their instructor's teaching as difficult to learn from: “I wish the instructor didn't just read from the slides and used wording that could be used for an audience of non-science people. It makes it harder to understand the concepts when the instructor just reads from the slides and doesn't provide an “easier” base to jump from. I feel as though the lecture should complement the book in a way that is not on the same level of difficulty, but easier to understand.” Comments focused on engagement were also viewed as part of teaching, such as “I loved how engaged we were in class and lab.” Instructor's caring related to comments focused on students’ social or emotional needs being met or not. For instance, “Dr. Wren was very kind and accommodating to students.” focuses on the student perceiving the instructor as generally considerate and the example in Table 6 linked an instructor's reduction of workload in the semester to their perception that students have lives outside class. Perceptions of a lack of caring seemed to stem from students’ feedback not being incorporated as desired, such as: “Despite our many suggestions to Dr. Wren, she seemed very unwilling to make changes or stick with making changes. If ever a change was made after our many complaints, she would make the change once and then would never follow through with doing it again.” Note, the same instructor received both positive and negative instructor's caring codes.

We coded other characteristics when other portions of the response were clear, but the precise characteristics were unclear. The example in Table 6 seemed to view something about the course/their instructor in a positive way, but what characteristics of the course were appreciated were unclear. Similarly, this student conveyed a neutral stance toward the course as a whole: “Course was ok.”

Connotations

Connotations codes attended to students’ language when characterizing whether the aspect being described was positive, negative, or neutral for the student (Table 7). Positive connotation codes were used when students wrote they “liked” something (as in the example in Table 7, characterized something as the “best” (as in the other characteristic example in Table 6), or indicated enjoyment or appreciation (e.g., “I find the content fascinating and would love to focus more on R and how to apply what we've learned to our own research.”). In contrast, negative connotation codes were used when students expressed dislike over unhelpful materials (e.g., example in Table 7), detailed suggestions for improvement (e.g., “One area for improvement would be to have the exam questions more closely match the material we learn in class, and discuss the textbook more since the exams seem to more closely follow the book.”), and declarations of unsatisfaction (e.g., “This is not a class that should be taught online. Unfortunately, this was the only option due to the circumstances.”). Comments were considered neutral if the connotation was deliberately neutral, unclear, or included both positive and negative aspects. For instance, we viewed “Course was ok” to be a deliberately neutral statement. We were uncertain how to interpret the connotation of “the weekly schedule,” as listed in Table 7. Both parts they liked and disliked of the course content (course quality) were listed, leading to a neutral code for this comment: “Particularly liked the endocrine system unit. Slightly disappointed the immune system unit was left out.” Similarly, because the questions were structured as asking for ways to improve the course, some students provided suggestions of ways to improve despite having glowing remarks in response to the prior questions, such as the following:

[Response to “Please indicate, with regards to the ONLINE implementation of this class (if applicable), what you particularly liked, disliked or what can be improved.”] “I liked that the instructor gave weekly announcements in regards to material or what was needed to prepare for upcoming assignments. The teacher often responded to emails or questions in a timely manner, so communication was good. I also like that we were able to see our grades within a week of taking exams, so it was easier to know where my grade is and how to improve. The study guides were extremely helpful for the exams and followed through with the questions given on tests. The teacher also offered office hours to meet and discuss any questions or concerns. There were also several extra credit opportunities as a bonus, which I think is great!” [Response to “Please indicate what you particularly like or dislike about this course, aside from the online aspect of this course (if applicable).”] I think this course was done well, it was easy to follow and there was good communication. [Response to “What would you suggest to improve this course?”] “The only suggestion I may add, if possible, is to add some quizzes or other assignments to the course, since there are only 5 grades in total, which are all exams.”

Findings from Frequencies of Qualitative Codes Among Conditions

To compliment the previous section that characterizes students’ responses qualitatively, we analyzed how students differed in their comments among treatments. Overall, students described instructors in more ways (according to the number of codes) on SETs with women instructors and the AntiBias treatment, X2 (1, N = 459) = 5.0318, p = 0.02. Women received 240 three-part codes overall compared with men who received 219 code descriptions in the Spring 2021 semester (Supplemental Material S6). These trends were consistent within nearly all three-part codes with the exception of “Explicit-Caring-Positive” (Men = 12 total, Women = 9 total), “Implicit-Teaching-Positive” (Men = 12 total, Women = 8 total), and “Implicit-Logistics-Negative” (Men = 40 total, Women = 38 total). No statistical analyses were performed on individual code components due to small sample sizes.

At the individual code component level (i.e., Phrasing, Aspect, and Connotation), women received more codes than men with two exceptions: the caring aspect and negative connotation codes. However, these differences across genders were not statistically significant (Table 8). SETs with men instructors and the Original statement were described more times in terms of caring than the other conditions (Table 8). Furthermore, SETs with men instructors and either treatment received slightly more negative codes than women instructors with either treatment (Table 8).

TABLE 8.

Frequencies of code components (phrasing, aspect, and connotation) across treatments and instructor gender.

Man Woman
Antibias Original Man Total Antibias Original Woman Total χ2 (p)
Phrasing
Explicit 17 25 42 34 27 61 1.75 (p = 0.18)
Implicit 77 77 154 88 66 154 1.31 (p = 0.25)
Other 9 14 23 16 9 25 2.06 (p = 0.15)
Aspect
Logistics 52 54 106 66 50 116 1.07 (p = 0.30)
Online 9 11 20 12 8 20 0.40 (p = 0.53)
Quality 14 14 28 16 14 30 0 (p = 1)
Caring 10 9 19 5 7 12 0.05 (p = 0.82)
Teaching 15 25 40 32 21 53 3.90 (p = 0.04)
Other 3 3 6 7 2 9
Connotation
Positive 51 61 112 81 60 141 3.09 (p = 0.79)
Negative 43 44 87 42 34 76 0.34 (p = 0.56)
Neutral 9 11 20 15 8 23 1.05 (p = 0.31)

Regardless of Original or AntiBias treatment and instructor gender conditions, the frequencies of the codes differed within each explicit versus implicit and instructor/course code category. Students most often gave responses using phrasing that was implicitly about the responsibilities of the instructor (i.e., instructor implicit code), which appeared at a 2- to 3-fold higher rate than phrasing explicitly about the actions of the instructor (code: instructor explicit; Table 8). These trends were seen across all four conditions. Similar to phrasing, students consistently commented about characteristics of the course over instructor characteristics (Table 8), regardless of condition. Students most often wrote about the logistics of the course, in about half of the coded responses (Table 8). Instructors’ teaching was the next most common code, in around a quarter of the responses. The other course characteristics of quality and online, along with the instructor characteristic of caring, were present in around 10% of responses. Other characteristics were coded the least. The connotation of comments was most frequently positive, with half or more of the codes per condition being positive; second was negative comments in about a third of codes, and finally neutral comments (Table 8). These trends were seen regardless of Original or AntiBias treatment or instructor gender.

DISCUSSION

In this study, we endeavored to examine whether an antibias statement would impact SETs for instructors of different genders at the department scale over three semesters. Our conceptual replication study aimed to understand the generalizability of the original findings from Peterson et al. (2019)—that the introduction of a simple antibias statement to SETs can mitigate bias toward instructor gender within the context of a biology department. We also expanded the quantitative study to include a qualitative perspective. We randomly assigned students to receive the Original and AntiBias statements at the beginning of the evaluations. The implementation of the Antibias treatment yielded variable results across semesters.

Complex Contexts Limit the Success of Mitigation of Gender Biases in SETs with the AntiBias Statement

In the Fall 2021 semester, students given the Antibias statement rated women instructors significantly higher compared with all other groups for the overall average ratings, average rating for questions explicitly about the instructor, and overall instructor rating. In the Spring 2021 and Spring 2022 semesters, there was no effect of Antibias statement on student SET ratings. Trends in the ratings for explicitly framed questions were driven by questions about the instructor's ability to provide an atmosphere of interest and enthusiasm, and the instructor's ability to present and explain clearly. Women instructors had statistically higher ratings compared with men instructors for both treatments in the Fall 2021 and Spring 2022 semesters, and statistically higher course ratings in the Spring 2021 semester. The implicitly framed questions showed no significant differences among conditions.

In some ways, our results align with Peterson et al. (2019) such that women instructor's scores were higher with the AntiBias statement compared with the Original statement in the Fall 2021 semester and men instructor scores did not differ between treatments. This aligns with the direction of change (or lack of change) in the Peterson et al. (2019) paper. However, our results differ from Peterson et al., (2019) in the Spring 2021 and Spring 2022 semesters where the Antibias statement did not affect SET ratings. Similarly, Genetin et al. (2022) showed that, at the college level of one university, their antibias treatments did not have an effect on overall SET scores, regardless of instructor gender. Furthermore, Kogan et al. (2022) saw no effects of their antibias statements across one university; however, they did not investigate the role of instructor gender on these results. Owen et al. (2024) tested the effects of a similar antibias statement in SETs, a study that garnered national attention (Elsesser, 2024), and found that adding an antibias statement did not affect SET ratings or mitigate biases against women instructors.

Because our work here is a conceptual replication that has intentionally expanded the study context from two courses in Peterson et al., (2019) to a whole biology department, there are several factors that could have contributed to the variable effects seen between these studies, including course level, disciplinary context, SET instruments, and the timing of influential cultural events in the U.S. Peterson et al., (2019) focused on two introductory courses whereas we included all levels of courses, as did Genetin et al. (2022) and Kogan et al. (2022). Some studies indicate that higher level courses and electives with smaller class sizes tend to have higher SET ratings compared with introductory courses and courses with larger class sizes (Miles and House, 2015; Wachtel, 1998). These trends can be magnified by instructor gender. For example, in a study by Miles and House (2015), the lowest rated courses were those with large class sizes instructed by women. Conversely, one study found no association with the level of the student and the SET scores they provide (Mengel et al., 2019). Adding complexity, departmental gender composition also has been shown to interact with instructor gender and course level in SETs. Aragón et al., 2023 found that men-dominated departments had more gender bias in SETs for women instructors in upper-level courses compared with gender parity and women-dominated departments (Aragón et al., 2023). In an experimental manipulation of hypothetical department composition, Aragón and colleagues (2023) also found that students’ expectations of which gender taught upper-level courses depended on the department composition. Bias toward women teaching lower-level courses was removed when hypothetical department composition was at parity or skewed toward more women instructors (Aragón et al., 2023). This nuanced complexity in the literature aligns with the variability of the effects of the intervention within our study and between our study and the original Peterson et al. (2019) study.

Peterson et al., (2019) combined data from an Introductory Biology and an American Politics course, whereas we focused solely on biology courses. Genetin et al., (2022) and Kogan et al., (2022) pooled data across a college and university, respectively, and did not investigate differences among disciplines or programs of study. Disciplinary contexts are unique and therefore may affect instructor ratings. Instructors in the natural sciences (e.g., biology) generally receive lower ratings than instructors in the social sciences (e.g., politics) (Heffernan, 2021), but this varies among studies. Rosen (2017) presents instructor ratings among ten disciplines, with biology receiving lower ratings than political sciences. Therefore, the differences in disciplinary inclusion of these studies may explain some of the variability of the results.

Peterson et al., (2019) analyzed three individual questions on overall instructor rating, teaching effectiveness, and overall course rating whereas we examined two specific prompts that aligned with theirs, overall SET average, and average responses to questions grouped to focus on the instructor or course. Genetin et al., (2022) and Kogan et al., (2022) used a single question on overall instructor rating. Although these data sources are reflective of each other, they are not the same and could have introduced some variability. Relatedly, variability could be attributed to our use of the department's standing SETs instead of a new assessment with strong evidence of validity and reliability. However, this was an intentional choice. As departments across the country were adopting this statement for their SETs, we aimed to test the effectiveness of the statement without changing other factors such as the SET questions themselves.

Peterson et al., (2019) collected data from Spring 2018 whereas we collected data from Spring 2021, Fall 2021, and Spring 2022. Genetin et al., (2022) and Kogan et al., (2022) collected data from Spring 2021. Societal changes in the United States since 2019 may have an impact on SET score differences. In the wake of social justice protests after the death of George Floyd in 2020, many institutions, including the one at which our study was conducted, provided new training programs for faculty and students intended to assist individuals in recognizing and reflecting on their biases. Additionally, the global COVID-19 pandemic impacted society in many ways, including how people interact with each other (Photopoulos et al., 2023) and what they expect from instruction (Zeng and Tingzeng Wang, 2021; Hollister et al., 2022). The COVID-19 pandemic shifted classes to mostly online for a period, including our study's Spring 2021 semester. Some have found that SETs are lower in online courses (Marzano and Allen, 2016), although our quantitative analysis did not show that. Kifle and Kler (2023) reported higher evaluation scores during COVID-19 pandemic periods where online teaching was the primary mode of instruction compared with pre-pandemic evaluation scores.

Unlike Peterson et al. (2019), our findings at the departmental level do not support that the simple inclusion of an Antibias statement mitigates the complex issue of gender biases toward instructors in SETs. However, our findings at the departmental level complement similar conclusions at the college and university level.

Differences in How Students Comment on Instructors in Open-Ended SET Responses Unaffected by the Inclusion of an Antibias Statement

Based on the qualitative analysis of Spring 2021, the Antibias statement had no significant effect on how the students wrote about instructors of different genders. Though, we observed that students receiving the AntiBias treatment wrote more, regarding the number of codes, though this trend was the same for both men and women instructors. The codes most frequently focused implicitly on course characteristics, and there were marked differences in the content of the comments. Women instructors in our study overall received more positive comments and, specifically, over twice as many explicitly positive comments about their teaching compared with men instructors. Men instructors, however, received more comments about the caring characteristic of their teaching—both explicitly and implicitly—compared with women instructors.

Similar to our results, Owen et al. (2024) tested the effects of a similar antibias statement in SETs, a study that garnered national attention (Elsesser, 2024), and found that adding an antibias statement did not mitigate biases against women instructors. Women instructors, overall, received less positive feedback and qualitatively different feedback than men instructors (Owen et al., 2024). Differences in positivity across instructor gender were especially prevalent for groups who taught large classes or were harsh graders (Owen et al., 2024). Feedback for women who taught larger classes or were harsher graders received more criticism compared with men instructors in these categories (Owen et al., 2024). In our study, there was no effect of Antibias statement treatment on SETs quantitative scores or qualitative comments for instructors of different genders. Women instructors in our study also received more qualitative comments compared with men instructors, though women in our study did receive more positive comments. Owen et al. (2024) examined the effects of an antibias treatment in classes across an entire liberal arts college. We demonstrate that an antibias treatment also has no effect when applied within one university department where faculty teaching the same discipline can be directly compared with each other.

Our result that women instructors received more positive comments, and men received more comments denoting whether they were caring differ from other qualitative studies on gender differences in open-ended SETs responses. Several studies found that women instructors were more often described as caring, nurturing, supportive, enthusiastic, and relatable, while men instructors were described as content experts, funny, professional, and challenging (Bachen et al., 1999; Sprague and Massoni, 2005; Adams et al., 2022; Sigurdardottir et al., 2022). Women instructors were also described in terms of physical appearance far more than men (Rosen, 2017; Mitchell and Martin, 2018; Heffernan, 2021). Though, student responses in our study did not include many instances of physical appearance because courses were held virtually during our data collection period of the Spring 2021 semester. Additionally, in other studies on the connotation of open responses, men instructors are described with more positive comments and fewer negative comments than women instructors (Chávez and Mitchell, 2020; Sigurdardottir et al., 2022), and the connotation for describing instructors as authoritative was described as a positive attribute for men but negative for women (Adams et al., 2022).

As stated above regarding our quantitative findings, the shift in how people interact with each other and what they expect from instruction because of the global COVID-19 pandemic may explain the difference in our findings from other studies on the open-ended SET responses for different instructor genders. In our study, we note that Spring 2021 was different from the other two semesters and thus further analyzed the students’ open-response answers. Given that the institution was mostly online in Spring 2021 and returned to mostly in-person in Fall 2021, some differences in evaluations may be attributable to a non-standard (online) course modality that required instructors to change their instructional practice, despite a similar course structure, and students to change their approach to learning. Kifle and Kler (2023) reported higher evaluation scores during COVID-19 pandemic periods where online teaching was the primary mode of instruction compared with pre-pandemic evaluation scores. Higher evaluation scores were attributed to organization, inspiration, and focus (Kifle and Kler, 2023). Our qualitative findings support Kifle and Kler's (2023) results that organization/logistics were an important course characteristic for students during pandemic time periods. Roughly half of the codes across all four conditions related to the logistics of the course, though contrary to Kifle and Kler's (2023) findings, logistics codes were split near equally between positive and negative connotations (45% positive, 41% negative, and 13% neutral).

Differences in the quality of instruction provided by men and women in the department may explain our observed evaluation score differences. It is possible that instructor quality between women and men instructors were unbiased in the SET scores and qualitative comments accurately reflect quality. Women instructors in our study overall received more positive comments and, specifically, over twice as many explicitly positive comments about their teaching compared with men instructors (Supplemental Material S6: see Explicit-Positive-Teaching 3-part codes). Alternatively, bias may exist if women instructors within the department have superior teaching quality resulting in SET scores that show little to no differences between genders, but are biased, nonetheless. Assessing instructor quality in relation to SET scores is difficult because quality itself is difficult to measure and multidimensional (Marsh and Roche, 1997). In a 13-y longitudinal study, years of experience was used as a proxy for instructor quality, but instructor SET ratings did not change over time (Marsh, 2007; Marsh and Hocevar, 1991). Though, years of experience alone do not capture an instructor's abilities or quality.

Similar to our quantitative results, the complex contexts under which SETs are evaluated impact how students comment on instructors in open-ended responses. In our study, the inclusion of an antibias statement did not affect the types of comments about instructors in open-ended responses, and differences in these comments between genders were similar between treatments.

Implications

The Peterson et al. (2019) intervention was intended to improve SETs with a minor change—mitigating biases against women with the addition of a short statement at the beginning of the SETs; however, our work here at the department scale does not support that conclusion. Within a department, contexts like instructor gender make up (Aragón et al., 2023), student demographics (Boring 2017; Mengel et al., 2019; Heffernan, 2021; Kreitzer and Sweet-Cushman, 2022; Sigurdardottir et al., 2022), and class level and sizes (Wachtel, 1998; Miles and House, 2015) can affect how instructors are perceived. These complexities seem to break down the success of the tested mitigation strategy. Furthermore, other replications at even larger scales (Genetin et al., 2022; Kogan et al., 2022) also do not support the conclusion that a simple antibias statement can mitigate gender biases in SETs. Altogether, this highlights the variable effects of a small, simple change on the complex issue of teaching evaluation, suggesting that more holistic and arduous approaches are needed. Although our results are not consistent enough to make specific recommendations, others have reviewed the literature and reflect on their own experiences to make meaningful use of the imperfect tool that are SETs. For example, Artze-Vega et al. (2023) recommend focusing on trends, collaborating with others in reviewing SETs, working with students on being transparent about pedagogical choices, and incentivizing feedback from all students. Kreitzer and Sweet-Cushman (2022) also emphasize working to boost response rates. Both groups encourage rebranding SETs to center student perceptions instead of as valid measures of teaching quality (Kreitzer and Sweet-Cushman, 2022; Artze-Vega et al., 2023) and describe their own personal work to advocate for improvements in teaching evaluation (Artze-Vega et al., 2023). Krishnan et al. (2022) and Esarey and Valdes (2020) are proponents of multiple diverse measures of teaching, such as peer reviews and self-reflections.

Finally, work on the improvement of SETs is needed as these evaluations are still being used by administrators to affect professional appointments. Moreover, instructors value feedback about teaching but are dissatisfied with standard SETs (Brickman et al., 2016); and valuing feedback on teaching could help motivate instructors to improve their teaching (Brickman et al., 2016; Dennin et al., 2017). Until improvements can be made to the SET system, instructors and administrators reviewing SETs should keep in mind the complex nature of factors that influence SETs.

LIMITATIONS AND FUTURE DIRECTIONS

Similar to us, Peterson and colleagues (2019) did not first establish gender biases in their samples, only the effect of the antibias intervention. Although not ubiquitous (e.g., Cone et al., 2022), there is strong evidence of biases in SETs based on instructor gender (Arbuckle and Williams, 2003; MacNell et al., 2015; Boring, 2017; Fan et al., 2019; Heffernan, 2021; Hoorens et al., 2021; Adams et al., 2022; Kreitzer and Sweet-Cushman, 2022) and was an underlying assumption in both their study and ours. We observed that women instructors received the same or higher evaluations scores overall than the men instructors, regardless of treatment. Findings from the literature indicate several factors that influence gender biases in SETs, including departmental gender parity between men and women instructors (Aragón et al., 2023), course level (Wachtel, 1998; Miles and House, 2015; Hoorens et al., 2021), perceived grade (Kogan et al., 2022), instructor age (Arbuckle and Williams, 2003; McPherson et al., 2009), and student identities, including gender (Boring 2017; Mengel et al., 2019; Heffernan, 2021; Kreitzer and Sweet-Cushman, 2022; Sigurdardottir et al., 2022), and race (Chisadza et al., 2019; Genetin et al., 2022). Our study department had balanced proportions of men and women as represented in the instructors included in this study, enrolls roughly twice as many women as men students with a diversity of races and ethnicities. These factors were not investigated here due to sample size and protection of anonymity. Furthermore, due to small sample size, we could not randomize treatments within courses, which means we could not directly control for differences in SETs as a result of the quality of instruction. Future work with an expansive dataset could investigate these considerations all together. Moreover, future work should include understudied identities like LBGTQIA+ and disability status.

Due to the departmental context, we were also limited in other ways. For example, we were limited to investigating instructor gender only between men and women because of the demographic composition of the department and who consented to participate in the study. Additionally, students were not required or externally incentivized to fill out SETs, only requested to do so with in-class time and/or by posting in the learning management system. Furthermore, the SET used by the department had not been previously investigated for evidence of validity or reliability and therefore may not be an appropriate instrument to test the effectiveness of an intervention. Additionally, the differences in tone between the Original and AntiBias statements could have impacted SET scores. However, we intentionally aimed to test how the simple addition of the established AntiBias statement from Peterson et al. (2019) would affect a department's SETs without changing other factors such as the SET questions themselves or the incentives to complete the SETs.

Furthermore, there were logistical limitations to this work. For example, an individual student could have filled out the SET form multiple times for multiple biology classes that they were enrolled in and therefore could have been exposed to different instruction statements (i.e., AntiBias vs. Original statements) each time. The presentation of one treatment could have still been on a student's mind when completing a different SET that had the other treatment. Additionally, students were all given the same Qualtrics form to complete these SETs and asked separately to identify the course and their instructor. Therefore, students could have nefariously or erroneously completed an SET for a course they were not enrolled in; however, we removed any responses that did not correctly pair the course and instructor, and students had no incentive to use their time to complete inapplicable SETs. Finally, our qualitative analysis focuses on the Spring 2021 semester, a period where the COVID pandemic shifted most courses to an online format. Student feedback during this semester was likely, in part, impacted by the transition to new modes of teaching and learning.

CONCLUSIONS

To promote more fair SETs, we applied an AntiBias statement previously shown to mitigate gender biases against women to the broader context of a biology department and explored the results quantitively and qualitatively. The AntiBias statement yielded variable results within our study and in comparison with the original study (Peterson et al., 2019) and two other replications (Genetin et al., 2022; Kogan et al., 2022). Sources of variability stemmed from differences in scale, instruments, and timing but overall highlights that the simple mitigation strategy may not be successful beyond the original study. This emphasizes the importance of work focused on replicability and reproducibility in science. The wide-spread publicity of the Peterson et al. (2019) paper sparked conversations across higher education about the value of using SETs and how their results might be impacted by small changes (Flaherty, 2019; Heffernan, 2021). In response, researchers have examined the initial intervention in new settings, including at the departmental (this study), college (Genetin et al., 2022) and institutional (Kogan et al., 2022) scales. Genetin et al. (2022) and Kogan et al. (2022) did not find a significant impact of the AntiBias treatment on SET scores, whereas Peterson et al. (2019) and one of our semesters did, which showcases the inherent variability in systems of higher education across institutions and departments. Because some complex variables cannot be controlled for, such as teaching practices and student demographics (National Academies of Sciences Engineering and Medicine, 2019, p.9), this further highlights the importance of examining interventions in a variety of contexts to better understand the generalizability and causal mechanisms behind interventions.

Supplementary Material

cbe-24-ar35-s001.pdf (341.5KB, pdf)

ACKNOWLEDGMENTS

Several people were helpful in this work. We thank Nicole Scheuermann for her literature on instructor disabilities. We are thankful for the Department's Committee on Diversity, Equity, and Inclusion for encouraging and facilitating this work. We also acknowledge all the Department's faculty and instructors for engaging with this work with open minds. This work was supported in part by the NIU Division of Research and Innovation Partnerships but had no role in the work. This work was not supported by external funding.

REFERENCES

  1. Adams, S., Bekker, S., Fan, Y., Gordon, T., Shepherd, L. J., Slavich, E., & Waters, D. (2022). Gender bias in student evaluations of teaching: Punish[ing] Those who fail to do their gender right. Higher Education, 83(4), 787–807. 10.1007/S10734-021-00704-9/METRICS [DOI] [Google Scholar]
  2. Anfara, V. A., Brown, K. M., & Mangione, T. L. (2002). Qualitative analysis on stage: Making the research process more public. Educational Researcher, 31(7), 28–38. 10.3102/0013189X031007028 [DOI] [Google Scholar]
  3. Aragón, O. R., Pietri, E. S., & Powell, B. A. (2023). Gender bias in teaching evaluations: The causal role of department gender composition. Proceedings of the National Academy of Sciences of the United States of America, 120(4), e2118466120. 10.1073/PNAS.2118466120/SUPPL_FILE/PNAS.2118466120.SAPP.PDF [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Arbuckle, J., & Williams, B. D. (2003). Students’ perceptions of expressiveness: Age and gender effects on teacher evaluations. Sex Roles, 49(9–10), 507–516. 10.1023/A:1025832707002/METRICS [DOI] [Google Scholar]
  5. Artze-Vega, I., Darby, F., Dewsbury, B., & Imad, M. (2023). The norton guide to equity-minded teaching. W.W. Norton. Retrieved from https://seagull.wwnorton.com/equityguide [Google Scholar]
  6. Bachen, C. M., McLoughlin, M. M., & Garcia, S. S. (1999). Assessing the role of gender in college students’ evaluations of faculty. Communication Education, 48(3), 193–210. 10.1080/03634529909379169 [DOI] [Google Scholar]
  7. Basow, S. A. (2000). Best and worst professors: Gender patterns in students’ choices. Sex Roles, 43(5–6), 407–417. 10.1023/A:1026655528055/METRICS [DOI] [Google Scholar]
  8. Boring, A. (2017). Gender biases in student evaluations of teaching. Journal of Public Economics, 145, 27–41. 10.1016/J.JPUBECO.2016.11.006 [DOI] [Google Scholar]
  9. Boring, A., Ottoboni, K., Stark, P. B., & Steinem, G. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen Research. 10.14293/S2199-1006.1.SOR-EDU.AETBZC.V1 [DOI] [Google Scholar]
  10. Brickman, P., Gormally, C., & Martella, A. M. (2016). Making the grade: Using instructional feedback and evaluation to inspire evidence-based teaching. CBE Life Sciences Education, 15(4), ar75. 10.1187/cbe.15-12-0249 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Centra, J. A., & Gaubatz, N. B. (2000). Is there gender bias in student evaluations of teaching? The Journal of Higher Education, 71(1), 17. 10.2307/2649280 [DOI] [Google Scholar]
  12. Chávez, K., & Mitchell, K. M. W. (2020). Exploring bias in student evaluations: Gender, race, and ethnicity. Political Science & Politics, 53(2), 270–274. 10.1017/S1049096519001744 [DOI] [Google Scholar]
  13. Chisadza, C., Nicholls, N., & Yitbarek, E. (2019). Race and gender biases in student evaluations of teachers. Economics Letters, 179, 66–71. 10.1016/J.ECONLET.2019.03.022 [DOI] [Google Scholar]
  14. Cone, C., Fox, L. M., Frankart, L. M., Kreys, E., Malcom, D. R., Mielczarek, M., & Lebovitz, L. (2022). A multicenter study of gender bias in student evaluations of teaching in pharmacy programs. Currents in Pharmacy Teaching and Learning, 14(9), 1085–1090. 10.1016/J.CPTL.2022.07.031 [DOI] [PubMed] [Google Scholar]
  15. Cooper, K. M., Auerbach, A. J. J., Bader, J. D., Beadles-Bohling, A. S., Brashears, J. A., Cline, E., ... Brownell, S. E. (2020). Fourteen recommendations to create a more inclusive environment for lgbtq+ individuals in academic biology. CBE Life Sciences Education, 19(3), es6. 10.1187/cbe.20-04-0062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Czocher, J. A., & Melhuish, K. (2024). Attending to coherence among research questions, methods, and claims in coding studies. Journal for Research in Mathematics Education, 55, 148–155. doi:10.5951/jresematheduc-2022-0037 [Google Scholar]
  17. Dennin, M., Schultz, Z. D., Feig, A., Finkelstein, N., Greenhoot, A. F., Hildreth, M., ... Miller, E. R. (2017). Aligning practice to policies: Changing the culture to recognize and reward teaching at research universities. CBE Life Sciences Education, 16(4), 1–8. 10.1187/cbe.17-02-0032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Elsesser, K. (2024). College Professors Tried To Reduce Gender Bias In Evaluations—But Couldn't. Retrieved from Forbes website: Retrieved from https://www.forbes.com/sites/kimelsesser/2024/10/22/college-professors-tried-to-reduce-gender-bias-in-evaluations-but-couldnt/
  19. Esarey, J., & Valdes, N. (2020). Unbiased, reliable, and valid student evaluations can still be unfair. Assessment & Evaluation in Higher Education, 45(8), 1106–1120. 10.1080/02602938.2020.1724875 [DOI] [Google Scholar]
  20. Fan, Y., Shepherd, L. J., Slavich, E., Waters, D., Stone, M., Abel, R., & Johnston, E. L. (2019). Gender and cultural bias in student evaluations: Why representation matters. PLoS ONE, 14(2), e0209749. 10.1371/JOURNAL.PONE.0209749 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Flaherty, C. (2019). Teaching evals: Bias and tenure. Inside Higher Ed. Retrieved from https://www.insidehighered.com/news/2019/05/20/fighting-gender-bias-student-evaluations-teaching-and-tenures-effect-instruction [Google Scholar]
  22. Foster, M. M. (2023). Instructor name preference and student evaluations of instruction. Political Science & Politics, 56(1), 143–149. 10.1017/S1049096522001068 [DOI] [Google Scholar]
  23. Fox, J., Weisberg, S., Price, B., Adler, D., Bates, D., Baud-Bovy, G., ... R-core (2023). Package ‘car’. (3.1-2). Retrieved from https://CRAN.R-project.org/package=car [Google Scholar]
  24. Gatwiri, K., Anderson, L., & Townsend-Cross, M. (2021). ‘Teaching shouldn't feel like a combat sport’: How teaching evaluations are weaponised against minoritised academics. Race Ethnicity and Education, 27(2), 1–17. 10.1080/13613324.2021.1890560 [DOI] [Google Scholar]
  25. Genetin, B., Chen, J., Kogan, V., & Kalish, A. (2022). Mitigating implicit bias in student evaluations: A randomized intervention. Applied Economic Perspectives and Policy, 44(1), 110–128. 10.1002/AEPP.13217 [DOI] [Google Scholar]
  26. Heffernan, T. (2021). Sexism, racism, prejudice, and bias: A literature review and synthesis of research surrounding student evaluations of courses and teaching. Assessment & Evaluation in Higher Education, 47(1), 144–154. 10.1080/02602938.2021.1888075 [DOI] [Google Scholar]
  27. Hollister, B., Nair, P., Hill-Lindsay, S., & Chukoskie, L. (2022). Engagement in online learning: Student attitudes and behavior during COVID-19. Frontiers in Education, 7, 851019. 10.3389/feduc.2022.851019 [DOI] [Google Scholar]
  28. Hoorens, V., Dekkers, G., & Deschrijver, E. (2021). Gender bias in student evaluations of teaching: Students’ self-affirmation reduces the bias by lowering evaluations of male professors. Sex Roles, 84(1–2), 34–48. 10.1007/s11199-020-01148-8 [DOI] [Google Scholar]
  29. Jean, A. S., Li, Y., Ghezzi, C., & Punnett, L. (2022). Work in progress: Updating end of semester course evaluations via backwards design to reduce student bias. ASEE Annual Conference and Exposition, Conference Proceedings. 10.18260/1-2-41057 [DOI]
  30. Kendall, K. D., & Schussler, E. E. (2013). Evolving impressions: Undergraduate perceptions of graduate teaching assistants and faculty members over a semester. CBE Life Sciences Education, 12(1), 92–105. 10.1187/cbe.12-07-0110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Khokhlova, O., & Lamba, N. (2023). Evaluating student evaluations: Evidence of gender bias against women in higher education based on perceived learning and instructor personality. Frontiers in Education, 8, 1158132. 10.3389/feduc.2023.1158132 [DOI] [Google Scholar]
  32. Kifle, T., & Kler, P. (2023). Does a rose by any other name smell as sweet? Assessing student evaluation of teaching ratings pre- and during the COVID-19 lockdown: An Australian study. Elgaronline, 2(2), 179194. doi:10.4337/aee.2023.02.06 [Google Scholar]
  33. Kogan, V., Genetin, B., Chen, J., & Kalish, A. (2022). Students’ grade satisfaction influences evaluations of teaching: Evidence from individual-level data and an experimental intervention. EdWorkingPaper No, 22–513. 10.26300/spsf-tc23 [DOI] [Google Scholar]
  34. Kreitzer, R. J., & Sweet-Cushman, J. (2022). Evaluating student evaluations of teaching: A review of measurement and equity bias in SETs and recommendations for ethical reform. Journal of Academic Ethics, 20, 73–84. 10.1007/s10805-021-09400-w [DOI] [Google Scholar]
  35. Krishnan, S., Gehrtz, J., Lemons, P. P., Dolan, E. L., Brickman, P., & Andrews, T. C. (2022). Guides to advance teaching evaluation (GATEs): A resource for STEM departments planning robust and equitable evaluation practices. CBE Life Sciences Education, 21(3), ar42. 10.1187/cbe.21-08-0198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. MacNell, L., Driscoll, A., & Hunt, A. N. (2015). What's in a name: Exposing gender bias in student ratings of teaching. Innovative Higher Education, 40(4), 291–303. 10.1007/s10755-014-9313-4 [DOI] [Google Scholar]
  37. Maher, J. M., Markey, J. C., & Ebert-May, D. (2013). The other half of the story: Effect size analysis in quantitative research. CBE-Life Sciences Education, 12(3), 345–351. 10.1187/cbe.13-04-0082 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Marsh, H. W. (2007). Do University teachers become more effective with experience? A multilevel growth model of students’ evaluations of teaching over 13 years. Journal of Educational Psychology, 99(4), 775–790. doi:10.1037/0022-0663.99.4.775 [Google Scholar]
  39. Marsh, H. W., & Hocevar, D. (1991). Student evaluations of teaching effectiveness: The stability of mean ratings of the same teachers over a 13-year period. Teaching and Teacher Education, 7(4), 303–314. 10.1016/0742-051X(91)90001-6 [DOI] [Google Scholar]
  40. Marsh, H. W., & Roche, L. A. (1997). Making students’ evaluations of teaching effectiveness effective: The critical issues of validity, bias, and utility. American Psychologist, 52(11), 1187–1197. doi: 10.1037/0003-066X.52.11.1187 [Google Scholar]
  41. Marzano, M. P., & Allen, R. (2016). Online vs. face-to-face course evaluations: Considerations for administrators and faculty. Online Journal of Distance Learning Administration, 19(4), 1–14. [Google Scholar]
  42. Maxwell, S. E., Delaney, H. D., & Kelley, K. (2018). Designing Experiments and Analyzing Data: A Model Comparison Perspective, 3rd ed. New York, NY: Routledge. [Google Scholar]
  43. McPherson, M. A., Todd Jewell, R., & Kim, M. (2009). What determines student evaluation scores? A random effects analysis of undergraduate economics classes. Eastern Economics Journal, 35(1), 37–51. Retrieved from https://www.jstor.org/stable/20642462 [Google Scholar]
  44. Mengel, F., Sauermann, J., & Zölitz, U. (2019). Gender bias in teaching evaluations. Journal of the European Economic Association, 17(2), 535–566. 10.1093/JEEA/JVX057 [DOI] [Google Scholar]
  45. Miles, P., & House, D. (2015). The tail wagging the dog: An overdue examination of student teaching evaluations. International Journal of Higher Education, 4(2), 116–126. doi: 10.5430/ijhe.v4n2p116 [Google Scholar]
  46. Mitchell, K. M. W., & Martin, J. (2018). Gender bias in student evaluations. Political Science & Politics, 51(3), 648–652. 10.1017/S104909651800001X [DOI] [Google Scholar]
  47. Lenth, R., Singmann, H., Love, J., Buerkner, P., & Herve, M. (2020). Package “emmeans.” doi: 10.1080/00031305.1980.10483031
  48. National Academies of Sciences, Engineering, and Medicine. (2020). Recognizing and Evaluating Science Teaching in Higher Education: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. 10.17226/25685 [DOI] [Google Scholar]
  49. National Academies of Sciences Engineering and Medicine. (2019). Reproducibility and Replicability in Science. Washington, DC: National Academies Press. 10.17226/25303 [DOI] [PubMed] [Google Scholar]
  50. National Research Council. (2012). Discipline-Based Education Research: Understanding and Improving Learning in Undergraduate Science and Engineering. Washington, DC: The National Academies Press. 10.17226/13362 [DOI] [Google Scholar]
  51. National Science Foundation, & The Institute of Education Sciences U.S. Department of Education. (2018). Companion Guidelines on Replication & Reproducibility in Education Research. Retrieved from https://www.nsf.gov/pubs/2019/nsf19022/nsf19022.pdf
  52. Nulty, D. D. (2008). The adequacy of response rates to online and paper surveys: What can be done? Assessment and Evaluation in Higher Education, 33(3), 301–314. doi: 10.1080/02602930701293231 [Google Scholar]
  53. Owen, A. L., De Bruin, E., & Wu, S. (2024). Can you mitigate gender bias in student evaluations of teaching? Evaluating alternative methods of soliciting feedback. Assessment & Evaluation in Higher Education, 1–16. 10.1080/02602938.2024.2407927 [DOI] [Google Scholar]
  54. Peterson, D. A. M., Biederman, L. A., Andersen, D., Ditonto, T. M., & Roe, K. (2019). Mitigating gender bias in student evaluations of teaching. PLoS ONE, 14(5), e0216241. 10.1371/JOURNAL.PONE.0216241 [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Photopoulos, P., Tsonos, C., Stavrakas, I., & Triantis, D. (2023). Remote and in-person learning: Utility versus social experience. SN Computer Science, 4(2), 1–13. 10.1007/S42979-022-01539-6/TABLES/2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. R Core Team. (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
  57. Reid, L. D. (2010). The role of perceived race and gender in the evaluation of college teaching on RateMyProfessors.com. Journal of Diversity in Higher Education, 3(3), 137–152. 10.1037/A0019865 [DOI] [Google Scholar]
  58. Rosen, A. S. (2017). Correlations, trends and potential biases among publicly accessible web-based student evaluations of teaching: A large-scale study of RateMyProfessors.com data. Assessment & Evaluation in Higher Education, 43(1), 31–44. 10.1080/02602938.2016.1276155 [DOI] [Google Scholar]
  59. Sigurdardottir, M. S., Rafnsdottir, G. L., Jónsdóttir, A. H., & Kristofersson, D. M. (2022). Student evaluation of teaching: Gender bias in a country at the forefront of gender equality. Higher Education Research & Development, 42(4), 954–967. 10.1080/07294360.2022.2087604 [DOI] [Google Scholar]
  60. Sprague, J., & Massoni, K. (2005). Student evaluations and gendered expectations: What we can't count can hurt us. Sex Roles, 53(11–12), 779–793. 10.1007/s11199-005-8292-4 [DOI] [Google Scholar]
  61. Storage, D., Horne, Z., Cimpian, A., & Leslie, S. J. (2016). The frequency of “Brilliant” and “Genius” in teaching evaluations predicts the representation of women and African Americans across fields. PLoS ONE, 11(3), e0150194. 10.1371/JOURNAL.PONE.0150194 [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Terkik, A., Prud'hommeaux, E., Alm, C. O., Homan, C., & Franklin, S. (2016). Analyzing gender bias in student evaluations. In: Matsumoto Y. & Prasad R., eds. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (868–876). Retrieved from: https://aclanthology.org/C16-1083
  63. Torchiano, M. (2022). Package “effsize”: Efficient Effect Size Computation. Retrieved from: https://github.com/mtorchiano/effsize/
  64. Uttl, B. (2024). Student evaluation of teaching (SET): Why the emperor has no clothes and what we should do about it. Human Arenas, 7, 430–437. Retrieved from: 10.1007/s42087-023-00361-7 [DOI] [Google Scholar]
  65. Wachtel, H. K. (1998). Student Evaluation of College teaching effectiveness: A brief review. Assessment & Evaluation in Higher Education, 23(2), 191–212. 10.1080/0260293980230207 [DOI] [Google Scholar]
  66. Warfa, A.-R. M. (2016). Mixed-methods design in biology education research: Approach and uses. CBE—Life Sciences Education, 15(4), rm5. 10.1187/cbe.16-01-0022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Zeng, X., & Tingzeng Wang, S. (2021). College student satisfaction with online learning during COVID-19. International Journal of Multidisciplinary Perspectives in Higher Education, 6(1), 182–195. 10.32674/JIMPHE.V6I1.3502 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

cbe-24-ar35-s001.pdf (341.5KB, pdf)

Articles from CBE Life Sciences Education are provided here courtesy of American Society for Cell Biology

RESOURCES