How Can I Help You? The Influence of Situation and Hostile Sexism on Perception of Appropriate Gender of Conversational Agents

Mathieu Pinelli; Elisa Sarda; Clémentine Bry

doi:10.5334/irsp.669

. 2023 Jul 12;36:10. doi: 10.5334/irsp.669

How Can I Help You? The Influence of Situation and Hostile Sexism on Perception of Appropriate Gender of Conversational Agents

Mathieu Pinelli ¹, Elisa Sarda ², Clémentine Bry ³

PMCID: PMC12372706 PMID: 40951799

Abstract

Conversational agents (CAs) are increasingly being developed on commercial websites nowadays. We tested in two studies whether gender stereotypes apply to non-gendered CAs. In the first study, participants evaluated whether CAs are expected to display more masculine or feminine characteristics in situations designed to be stereotypically male or female. The sexist attitudes of the respondents were also measured. As predicted, participants perceived that a CA should be more masculine in stereotypically male situations and more feminine in stereotypically female situations. Moreover, we found that hostile sexism but not benevolent sexism moderated the effect of the gendered situation. The second study replicated the results while addressing the limits of Study 1, showing the robustness of these effects. These findings are consistent with models of gender stereotypes in humans and robots and show for the first time a moderation effect of (hostile) sexism in a customer service context with CAs. The processes involved in human relationships seem relevant in a digital environment that involves CAs. Researchers and professionals should work together to avoid reproducing and perpetuating gender stereotypes when developing CAs.

Keywords: Ambivalent Sexism, Gender Biases, Conversational Agents

Introduction

Interactions between machines and humans have aroused many fantasies since the early development of computers, robots, and artificial intelligence. The claims that ‘Machines will replace humans’ or ‘we will no longer differentiate between humans and machines’ are often heard in everyday talk. Fiction stories about machines taking over humans are numerous (e.g., The Terminator, The Matrix, and Westworld, to name just a few films and TV shows).

Robots and artificial intelligence applications are increasingly being used on-line to help users with customer services and to simulate a realistic human presence. We focus in this paper on conversational agents (CAs) designed to interact with humans using natural language (Dale, 2016; Feine et al., 2019). Conversational agents are almost a must-have on a commercial website these days (e.g., there were 300,000 CAs on Facebook in 2018),1 and they have positive consequences on users by increasing satisfaction and giving the feeling of a social presence (Chung et al., 2020; Feine et al., 2019). Conversational agents can be found in the form of personal assistants (e.g., Cortana, Alexa, Siri), as customer services support, in multiple technical support roles (smartphones, tablets, or computers), and in various fields, such as education, healthcare, and marketing (Bickmore & Gruber, 2010; Chung et al., 2020; Provoost et al., 2017; Tegos & Demetriadis, 2017).

Conversational agents are increasingly sophisticated and are used on a daily basis in direct contact with users in a B2C context (e.g., Chung et al., 2020). The development of CAs requires trade-offs between different technical and social features (Feine et al., 2019). One of the inevitable questions lies in relation to a possible gender for CAs, as CAs are used to increase the feeling of a human social presence and human interactions are coloured, for better or for worse, by gender and gendered behavioural expectations. Users may therefore expect gendered features for CAs; at least, developers seem to think so and have therefore produced gendered CAs.2 In this paper, we question whether people actually expect a gendered CA and the factors that would trigger such gendered expectations. No experimental study to our knowledge has studied the gendered expectations in relation to CAs before. The literature about gender features in human-human interactions and in robot-human interactions can help delineate what we can expect from CAs.

Gender in Human Interactions

Gender and its associated beliefs are central in our social relationships (Eagly & Wood, 2016; Ellemers, 2018). Men and women are believed to be similar in some ways but very different in many other ways. These beliefs influence not only our perceptions, but also our behaviour (e.g., Ellemers, 2018; Spencer et al., 2016), thus reinforcing themselves as men and women adopt gendered social roles (Eagly & Wood, 2012). These gendered social roles give the impression that they are innate and inevitable, and therefore seem to be inherent in our society (Eagly & Wood, 2016).

Gender stereotypes are both descriptive (that is, what people are) and prescriptive (that is, what people should be; Prentice & Carranza, 2002; Eagly & Karau, 2002; Ellemers, 2018). Extensive research has identified two core dimensions in social perception: Communion and Agency (or warmth and competence; see, for instance, Abele et al., 2008; Fiske et al., 2007; Judd et al., 2005). Communion is related to warmth, sympathy, emotional sensitivity, and concern with others, whereas Agency is related to competence, assertiveness, confidence, and self-control (e.g., Cuddy et al., 2008; Eagly & Karau, 2002). Social perception research has found that men are described as more agentic than women and that women are described as more communal than men (Eagly & Steffen, 1984; Ellemers, 2018). Furthermore, matching the prescription, men’s behaviour is expected to be related to competence and agency, while women’s behaviour is expected to be related to warmth and care (Prentice & Carranza, 2002). These gender norms define what traits are acceptable (or unacceptable) for men and women, and breaking the gender norms can lead to prejudice (e.g., Eagly & Karau, 2002). Gender norms define the behaviour that women and men should display and, thus, the situations that conform to each gender. Situations involving care and communality are deemed more appropriate for women, and reciprocally women are perceived as better suited for care and warmth situations. On the other hand, situations that require competence, assertiveness, and confidence are deemed more appropriate for men, and reciprocally men are perceived as better suited for competence and assertiveness situations (Eagly & Wood, 2012; Ellemers, 2018).

From the gender stereotype literature, we can infer that some people could expect an interaction agent (here a CA) to match a specific gender social role. The gender role could be cued, for instance, by the situation at hand. A situation involving warmth and care would cue to a female gender role, while a situation involving competence and assertiveness would cue to a male gender role. Interestingly, CAs are used in a variety of situations, with some situations being more related to warmth and care (e.g., using the guarantee attached to a hairdressing appliance) and other situations being related to competence and assertiveness (e.g., financial services allowing customers to save and invest money). Users could expect the CA to conform to a female gender role in a warmth-related situation, whereas they may expect the CA to conform to a male gender role in a competence-related situation. At least, those predictions would hold if social roles were to be applied to artificial intelligence and machines. The literature on robot-human interaction may help us understand whether there is solid ground for such hypotheses.

Gender in Human-Robot Interactions

Some studies have shown that people react to computers in the same way as they do to humans (Feine et al., 2019; Nass & Moon, 2000), and that people are able to interact with computers in the same way as they do with humans (Nass et al., 1997). The Computers Are Social Actors (CASA) model states that people interacting with computers have social reactions similar to human social interactions according to social cues like voice, gesture, physical design, or the apparent ‘gender’ (e.g., Eyssel & Hegel, 2012; Feine et al., 2019; Gong, 2008; Nass et al., 1997).

Voice is an important social cue defining personality and gender attribution. Nass et al. (1997) found that a high-pitched synthetic voice was associated with a ‘female’ computer, whereas a low-pitched synthetic voice was associated with a ‘male’ computer. Their study showed that humans react to a computer by applying the same social rules they usually reserve for social interactions between humans (see also: Nass & Moon, 2000). More recently, Eyssel and Hegel (2012) tested the effect of gendered facial features of robots on perception and description. They reported that short-haired robots (i.e., those with a male facial feature) were perceived as more agentic than long-haired robots (i.e., those with a female facial feature), which were perceived as more communal. Furthermore, tasks (such as repairing technical equipment) were perceived as more suitable for a ‘male’ robot and conversely female-dominated tasks dominated by women (such as household maintenance) were perceived as more suitable for a ‘female’ robot. More recently, Bernotat et al. (2021) showed how body shape also influences the perception of a robot. Their results indicated that stereotypically female activities and communal attributions were associated with a robot with a female body shape rather than with a male body shape. Furthermore, they showed that benevolent sexism (but not hostile sexism) marginally affected the agency attribution. Correlation analysis showed that the higher benevolent sexism was, the more agency was attributed to the robot.

Therefore, gender stereotypes are applied to robots. Several studies have extended this research to CAs, showing that social features affect users’ satisfaction, but also their perceptions of truthfulness, credibility, and social presence (Araujo, 2018; McDonnell & Baxter, 2019; Toader et al., 2020; Verhagen et al., 2014). Humans can interact with CAs in a natural language and adopt behaviours they usually have with their peers, that includes abuse, harassment, and mistreatment (Brahnam & De Angeli, 2012). Verbal abuse and sexual communication during interaction with CAs are common (De Angeli & Brahnam, 2008). For example, Brahnam and De Angeli (2012) showed that 18% of the conversation was focused on sexual attention and negative stereotypes with female CAs compared to 10% with male CAs and only 2% with non-gendered CAs.

Overall, the literature shows that people interact with CAs or robots in a similar way as they do with human fellows. Sometimes, these interactions with CAs or robots can also exacerbate negative social processes such as gender stereotypes, harassment, or gender-based division of labour with the consequence of reproducing and reinforcing sexism daily in our society (Brahnam & De Angeli, 2012; Eyssel & Hegel, 2012; Nomura & Suzuki, 2022).

It appears that gender roles are used to interact with CAs and that gender stereotypes are applied to CAs as well as humans. Human features (e.g., a voice and/or a face) are implemented to improve the user’s experience, giving a personalized service anytime and anywhere (e.g., Chung et al., 2020), and these human features can increase inferences of social roles. However, with CAs, the interactions are generally in a written form, through a chat, which means that such human features are not relevant. There might sometimes be an avatar displaying a male or female character, but this gendered avatar is not systematically present. Therefore, most CAs could be more gender neutral than robots. Unable to rely on gendered features, will people still project gender roles on CAs? When the CA has no gender feature, is the (gendered) situation enough to trigger gender expectations toward the conversational agent? Actually, we believe that adherence to sexism could play a role.

Sexist Attitudes

Gender stereotypes have been extensively studied in human interactions, and some studies have extended that literature to robot interactions. In human interactions, the use of gender stereotypes depends on sexist attitudes. Sexism was once studied as a unitary dimension, but Glick and Fiske (1996) offered a more nuanced definition with their theory of ambivalent sexism. They proposed that two sorts of sexism coexist, as the two faces of the same coin: hostile sexism and benevolent sexism. Hostile sexism matches the more traditional sexist attitudes reviewed in the literature, comprising a negative attitude towards women, with feelings of antipathy and a fear that women will take power over men (Glick & Fiske, 1996). Hostile sexism can be expressed through discrimination in employment. Studies have shown, for example, that individuals higher in hostile sexism are less likely to recommend a female candidate for a managerial position (Masser & Abrams, 2004). Benevolent sexism, on the other hand, can be seen as a ‘more positive’ attitude toward women, associated with chivalry and paternalistic attitudes (Glick & Fiske, 1996). In this form of sexism, women are perceived as having a higher moral purity than men and as too fragile to undertake tasks involving strength (protective paternalism). They are also perceived as creatures without whom men cannot be complete and possess qualities that men do not possess. Those individuals higher in benevolent sexism therefore assign women to less challenging tasks (King et al., 2012), and perceive men as more agentic and women as more communal (Rudman & Kilianski, 2000). Benevolent sexism can be seen as more positive than hostile sexism, though both attitudes involve prejudice against women, placing them below men (e.g., Stermer & Burkley, 2015). For example, by describing women as warmer than men, benevolent sexism suggests that women are less competent than men (Kervyn et al., 2012).

The Current Research

Gender stereotypes infuse our social life and influence our interactions in a variety of contexts, including marketing, workplaces, and robot interactions (Bernotat et al., 2021; Grau & Zotos, 2016; Koch et al., 2015). With digital growth, the question of the influence of gender stereotypes in digital contexts involving virtual CAs is of importance. Several previous studies have focused on gender stereotypes in robots (e.g., Eyssel & Hegel, 2012), but no study has experimentally tested gender biases and sexist attitudes with CAs. We believe that there is little reason to expect that gendered CAs would not trigger gender stereotyping. However, we wondered whether neutral CAs would still be the target of sexist stereotypes and if stereotyping would be predicted by the participants’ own level of sexist attitude (i.e., hostile and benevolent sexism). We reasoned that according to the commercial service one is looking for (e.g., advice about saving money vs. finding beauty products), people could consider the situation as stereotypically masculine or feminine. Our two studies aimed to test the impact of stereotypically male and female situations on the perception of appropriate features for CAs (gender, warmth, and competence) and the moderator effect of ambivalent sexism, represented by hostile and benevolent sexism.

In this paper, we extend previous work and test whether perceptions of gender-undefined CAs are also influenced by gender stereotypes and sexist attitudes. In two studies, participants were presented with several stereotypically ‘gendered’ situations in which they had to indicate the most appropriate characteristics (i.e., gender, warmth, and competence traits) for the CA. We formulate the following hypothesis:

H1a: Participants would consider the male gender to be more appropriate for the CA in stereotypically male situations and the female gender to be more appropriate in stereotypically female situations.

H1b: Participants would deem warmth features more appropriate for the CA in stereotypically female situations and competence features more appropriate in stereotypically male situations.

H1c: The effect of stereotypically male and female situations would be moderated by sexist attitudes such that the more sexist (hostile and/or benevolent) the participant, is the more they would rely on gender stereotypes in their evaluation of the appropriate characteristics of the CAs.

Study 1

Method

Participants

A power analysis was performed using G*Power 3.1 (Faul et al., 2007) with a small to moderate effect size of f² = .10, using a within-subjects design and based on the literature on sexism (e.g., McCarty & Kelly, 2015). This power analysis suggested that we needed 114 participants for a power level of .80. Thus, 117 participants took part in our online study. French-speaking participants were recruited on the Prolific platform (only participants with 95% positive rates were included) and they received £0.84 for their participation. Fifteen participants were excluded after an initial sort,3 so the final sample included 102 participants (M_age = 30.54, SD = 10.56; 38 women and 64 men). As we do not meet the number of participants recommended by the power analysis, we performed a sensitivity analysis to indicate what effect size was detectable with the final sample at 80% power (threshold of .05, 102 participants, and 20 predictors in the linear model) using G*Power. The analysis indicated that with this design, the minimum effect we could detect would be f² = .11.

Material and Procedure

To reduce participants’ suspicions towards the purpose and hypothesis of the study, the cover story presented the two parts as two separate studies, which were said to be combined for economic reasons. The alleged goal of the first ‘study’ was to validate questionnaires in different domains (marketing, ecology, gender perception). The participants were informed that they would randomly answer only one of three possible questionnaires. Actually, they always answered the gender perception questionnaire, which consisted of the Ambivalent Sexism Inventory (Glick & Fiske, 1996) validated in French (Dardenne et al., 2006). We used the short version of Rollero et al. (2014). The scale consists of two dimensions: hostile sexism and benevolent sexism. Both subscales are composed of six items (e.g., women seek power by having control over men; many women have a kind of purity that men do not). The participants provided a response for each item on a scale from 1 (not at all) to 6 (completely) and obtained a mean score for hostile sexism and a mean score for benevolent sexism.

The participants then moved on to the alleged Study 2, presented as a marketing research about the development of online CAs. A conversational agent was defined as ‘a computer program capable of conducting a conversation’, so that all participants had the same representation of a CA. The participants were told that they would be presented with different online situations in which a customer (of unspecified gender)4 would resort to a CA to answer their request. The participants’ task would be to indicate the CA’s most appropriate features to match the customer’s needs in each situation. Participants were instructed to answer from the customer’s point of view and not from their own, in order to limit social desirability bias (Fiske et al., 2002). Nine situations were presented in a random order to each participant (using a within-subjects design). The situations5 were related to online banking services and to retail websites, and were designed to conform to stereotypically male (N = 3), female (N = 3), or neutral (N = 3) gender norms.

For each situation, the participants answered a questionnaire on the CA’s appropriate features. They first evaluated the appropriate CA gender (from 1 = male to 5 = female), and its appropriate age (in its twenties, thirties, forties, or fifties). Then participants were required to rate the relevance of eight traits for the CA on a Likert scale ranging from 1 (not at all) to 5 (very much). Agency and communion traits were used to study gender stereotypes in robots. However, agency is related to actions in the world, which is not relevant to conversation agents. We therefore chose traits related to competence and warmth instead, as they are more general (see Cuddy et al., 2008). These items were adapted from Fiske et al. (2002). Trustworthy, friendly, well-intentioned, and warm evaluated the warmth dimension, and competent, intelligent, capable, and efficient evaluated the competence dimension.

Participants then completed a post-experimental questionnaire. We measured the attitude toward CAs with four items adapted from Venkatesh et al. (2003) on a 7-point Likert scale, and one item measured the frequency of use (from 1 = Never to 5 = Very often). The five items comprised an attitude index (α = .88). The higher the score, the more positive is the participant’s attitude towards CAs. We checked for suspicions regarding the actual/alleged goals of the study and the possible influence between the different parts, with three open questions. The answers were coded by the authors and rated from 0 = not suspicious to 3 = completely suspicious. Finally, a socio-demographic questionnaire collected the age, sex, socio-professional category, and nationality of each participant.

Results

Analysis Plan

Given our design, we used linear mixed-effects models with fixed and random effects variables. All analyses were conducted in R, using mixed-effects models with the lme4 package (Bates et al., 2021). Mixed models allow the use of fixed-effect variables (as in ANOVA) and random-effect variables.

Dependent Variables

We computed a perceived appropriate gender (1 = male to 5 = female), an appropriate warmth index (mean evaluation of warmth traits from 1 = not at all to 5 = very much), and an appropriate competence index (mean evaluation of competence traits from 1 = not at all to 5 = very much) for each scenario.

Independent Variable with Random Effects

The participants and the nine situations were variables with random effects. Therefore, we included in the model the estimation of their intercept and slope by situation or slope by sexism level, respectively.6

Independent Variable with Fixed Effects: The Situations

We created two contrasts to test a linear trend from stereotypically male to the stereotypically female situations through the neutral one. We coded the first contrast C1: female = +1; neutral = 0; male = –1, and the residual contrast C2: female = –1; neutral = +2; male = –1. If the trend is linear, we expect that C1 is significant and C2 to be not significant.

Both contrasts C1 and C2, participants’ gender (–1 = woman, +1 = man), benevolent sexism (centred), hostile sexism (centred), suspicion level (centred), and attitude toward CAs (centred) were entered as fixed effects in the linear mixed-effect model (see Judd et al., 2012).

All measures showed good internal consistency (see Table 1). Following Judd et al. (2012) and Judd et al. (2017), we compared models with and without each random parameter in order to retain the most conservative model. We followed the same rationale with fixed effects.7 The results corresponding to the tested hypotheses are presented in Table 1 (see mixed-effects models on our OSF page).

Table 1.

Means (SD) and Cronbach’s alpha of variables included in the model (Study 1).


	MEAN (SD)	CRONBACH’S ALPHA

Hostile sexism	2.58 (1.21)	.90

Benevolent sexism	2.88 (1.11)	.82

Appropriate Competence	4.50 (0.50)	.73

Appropriate Warmth	4.07 (0.60)	.74

Attitudes toward CAs	4.52 (1.14)	.88

Open in a new tab

The Appropriate Gender of the Conversational Agent

Suspicion level, gender of participants, attitudes towards CAs, and benevolent sexism did not have a valuable input in the model and were therefore discarded. We found a significant effect of C1, t = 3.12, p = .016, but not of C2, p = .90. As expected, we found a significant effect of stereotypical situations. We observed that the appropriate gender linearly increases toward femininity (Figure 1) when passing from stereotypically masculine situations (M = 2.67; SD = 0.76) to stereotypically feminine situations (M = 3.42; SD = 0.78).

Effect of the stereotypical situations on the CA’s appropriate gender (bars represent confidence intervals).

Moreover, the interaction between hostile sexism and C1 was significant, t = 3.82, p = .002, and the interaction with C2 was not, p = .90. The effect of the stereotypical situations increases with participants’ hostile sexism. The more sexist the participants are, the more they consider that the CAs’ gender should match the gendered situations (see Figure 2).

Conversational agents appropriate gender according to the gendered situations (represented by C1) and hostile sexism (centred). A lower value in the appropriate gender corresponds to a rather masculine gender, and a higher value corresponds to a rather feminine gender.

The Appropriate Level of Warmth

The suspicion level, the gender of the participants, and hostile sexism did not have a valuable input into the model and therefore were discarded. We did not find a significant effect of gendered situations on the appropriate level of warmth, C1: t < 1, p = .61, C2, t < 1, p = .36. We did not find a significant interaction with benevolent sexism. The interaction between the attitude towards CAs and C1 was significant, t = 2.91, p = .003, but not with C2, p = .66. Participants perceived warmth to be more appropriate in the female stereotypical situations than in the male stereotypical situations, when they have a more positive attitude toward CAs (Figure 3).

Appropriate level of warmth for the Conversational Agent according to the gendered situations (represented by C1) and attitude towards CAs (centred).

The Appropriate Level of Competence

Suspicion level, gender of participants, attitudes towards CAs, hostile sexism and benevolent sexism were found to have no valuable input in the model, so these variables were discarded. We found a significant effect of C1 on the appropriate level of competence, t = –3.03, p = .017, but not of C2, p = .51. The results showed a linear decrease in the appropriate level of competence when moving from stereotypically masculine situations (M = 4.62; SD = 0.34) to stereotypically female situations (M = 4.46; SD = 0.40).

Discussion of Study 1

The goal of this first study was to test the effect of stereotypically gendered situations on the expected features of a neutral conversational agent, according to hostile and benevolent sexism. The results partly support our hypothesis. The appropriate gender for a neutral CA was regarded as more female in stereotypically female situations and more male in stereotypically male situations, and this effect increased according to the level of hostile sexism. In addition, the competence traits were perceived more appropriate in male situations than in female situations. Interestingly, we did not find these effects in relation to warmth. Instead, the appropriate level of warmth was predicted by participants’ attitudes towards CAs differently in stereotypically male and female situations. Specifically, the more positive the participant’s attitude toward CAs is, the more the participants perceived warmth as appropriate in stereotypically female situations compared to stereotypically male situations. This effect was not expected and needs replication.

In this study we did not control the customer’s gender and used a within-subjects design: Participants were exposed to the nine situations. The within-subjects design may have increased the participants’ awareness of our hypotheses related to gender stereotypes. Furthermore, the customer’s gender being unspecified, the gendered situations may have influenced not only the CA’s perceptions but also the perceptions of the customer. Participants may have inferred that the customer is a woman in stereotypically female situations and a man in stereotypically male situations. This inference could have influenced participants through unexpected processes. Research shows that people prefer CAs that look like them and have a similar gender (ter Stal et al., 2020; Bailenson et al., 2008). Hence, men would prefer masculine CAs and women would prefer feminine CAs. However, to decrease social desirability, we asked participants to take the perspective of an average customer (not their own perspective), and interestingly we found no effect of the participants’ gender. However, in order to meet the requirements of the task (i.e., rate the appropriate level of traits to increase the customer satisfaction), participants could have answered based on the two uncontrolled inferences that the customer is a woman (a man) in female (male) situations and based on their gut feeling that a female (male) customer would prefer a female (male) agent to match the customer gender. Since we want to ascertain that the gendered situations influence the perception of the agent and answers are not related to the customer gender, we decided to manipulate the customer gender in Study 2.

Thus, we conducted a second study, with a larger sample, in which we controlled the gender of the customer and used a between-subjects design to minimize any awareness about our hypotheses, by limiting the number of situations presented.

Study 2

In this study, we aimed to replicate the results of Study 1 and to overcome its limitations. We presented only one gendered situation type (male, neutral, or female) using a between-subjects design, and we presented the customer as either a man or a woman. We formulate the following hypothesis:

H2a: Participants would consider the male gender more appropriate for the CA in stereotypically male situations and the female gender more appropriate in stereotypically female situations, regardless of the customer’s gender.

H2b: Participants would deem warmth features more appropriate for the CA in stereotypically female situations and competence features more appropriate in stereotypically male situations, regardless of the customer gender.

H2c: We expected the effect of the stereotypically male and female situations to be moderated by sexist attitudes such that the more sexist (hostile or benevolent) the participant is, the more they would rely on gender stereotypes in their evaluation of CAs, regardless of the customer gender.