Human–AI collaboration: trade-offs between performance and preferences

Lukas W Mayer; Sheer Karny; Jackie Ayoub; Miao Song; Danyang Tian; Ehsan Moradi-Pari; Mark Steyvers

doi:10.1186/s41235-026-00713-1

. 2026 Feb 27;11:18. doi: 10.1186/s41235-026-00713-1

Human–AI collaboration: trade-offs between performance and preferences

Lukas W Mayer ^1,^✉,^#, Sheer Karny ^1,^#, Jackie Ayoub ², Miao Song ², Danyang Tian ², Ehsan Moradi-Pari ², Mark Steyvers ¹

PMCID: PMC12949212 PMID: 41758389

Abstract

Despite the growing interest in collaborative AI, designing systems that seamlessly integrate human input remains a major challenge. In this study, we developed a task to systematically examine human preferences for collaborative agents. We created and evaluated five collaborative AI agents with strategies that differ in the manner and degree they adapt to human actions. Participants interacted with a subset of these agents, evaluated their perceived traits, and selected their preferred agent. We used a Bayesian model to understand how agents’ strategies influence the human–AI team performance, AI’s perceived traits, and the factors shaping human preferences in pairwise agent comparisons. Our results show that agents who are more considerate of human actions are preferred over purely performance-maximizing agents. Moreover, we show that such human-centric design can improve the likability of AI collaborators without reducing performance. We find evidence for inequality-aversion effects being a driver of human choices, suggesting that people prefer collaborative agents which allow them to meaningfully contribute to the team. Taken together, these findings demonstrate how collaboration with AI can benefit from development efforts, which include both subjective and objective metrics.

Keywords: Human–AI collaboration, Dynamic decision-making, Human-centered algorithm design, Bayesian modeling of preferences

Significance

Human-AI collaboration is expected to grow in the coming years. Particular attention is being paid to agentic cooperative AI that is capable of autonomously performing helpful tasks without repeated human instruction due to its potential to significantly improve the performance of human-AI teams. However, the use of cooperative AI agents poses two key challenges: (1) the development of such agents in modern multiagent reinforcement learning paradigms often excludes human collaborators, and (2) the process of integrating human preferences into the algorithms underlying AI agents remains poorly understood. Our study addresses these shortcomings by establishing an empirical framework to evaluate how algorithmic changes can be mapped to human preferences. Our study reveals key dynamics, such as algorithm changes that increase human liking of the AI agent without harming the performance of the human-AI team, and a pronounced human preference for inequity-aversion. These findings inform human-AI development by demonstrating how collaborative AI can be both effective and enjoyable. Our approach adjusts agent behavior by modifying algorithmic inputs and outputs, making it broadly applicable to new and existing agentic systems

Introduction

Contemporary AI technologies have matured to the point where their integration into everyday activities has become feasible. This integration is taking place in a wide range of fields including healthcare, education, and gaming (Maslej et al., 2024). One setting for AI integration that is becoming increasingly common involves a user prompting an AI, for example a chatbot. Here, the user explicitly instructs the AI to produce an output or the AI simply offers non-binding suggestions to the user (Bansal et al., 2019; Vodrahalli et al., 2022). In these types of interactions, human–AI collaboration happens sequentially: A user prompt is followed by an AI response with the human always remaining the ultimate decision-maker. One alternative setting that has gained interest recently is collaboration with an agentic AI, where an AI agent can take actions independently from the human (Bennett et al., 2025; Carroll et al., 2019; Crandall et al., 2018; McKee et al., 2024; Nalepka et al., 2019; Puig et al., 2024; Strouse et al., 2021). An AI being able to act independently from the human could purportedly yield productivity gains due to the ability to concurrently distribute labor across both the human and this agentic AI. For example, imagine that a human and an AI are collaborating on a software project with multiple outstanding tasks. Instead of merely advising the human on how to tackle each task, as in the chatbot setting, the agent might independently address some of the more routine tasks, thus leaving more time for the human to focus on the more challenging tasks.

Current machine learning research typically views agentic AI designed to collaborate with people as a multi-agent reinforcement learning (MARL) problem (Gronauer & Diepold, 2022). A common difficulty in MARL research is integrating human considerations into the algorithms that give rise to the agentic AI’s behavior. For example, researchers have modeled agents from human demonstrations (behavior cloning), but this method has several notable limitations. First, human data are relatively costly to collect. Second, training an algorithm to reproduce human-like behavior does not explicitly integrate validated design principles (Codevilla et al., 2019). Finally, behavior cloning struggles to perform as well as more simulation-based methods (Strouse et al., 2021). This combination of factors limits the applicability of behavior-cloning approaches.

Due to the complexity of integrating humans in the modeling, the development of human-in-the-loop multi-agent systems often neglects recent behavioral studies, which have shown that effective collaborative AI should take into account subjective factors beyond objective performance measures (Crandall et al., 2018; Ho et al., 2016; McKee et al., 2024; Puig et al., 2021; Siu et al., 2021; Tang et al., 2022; Zhang, 2023; Zhang et al., 2021). For example, people prefer an AI agent whose behavior is predictable and transparent, as these characteristics make the AI’s actions more understandable and reliable (Crandall et al., 2018; Ho et al., 2016; Tang et al., 2022; Zhang, 2023; Zhang et al., 2021). Similarly, people prefer non-adaptive, rule-based agents over learning-based agents due to their predictability and ease of interaction (Siu et al., 2021). The contradiction is evident when we compare human preferences like predictability and simplicity to a contemporary MARL algorithm, which involves complex and adaptive behaviors that are inherently opaque in their decision-making (Gronauer & Diepold, 2022). Finally, there is evidence to suggest that people expect collaborative AI agents to exhibit certain behavioral characteristics, or “traits”. These traits include behaviors that elicit perceptions of warmth, competence, intentionality, and engagement (McKee et al., 2024; Siu et al., 2021); concepts that receive little attention in algorithm development. Given this divide between the technical development of agentic AI systems and people’s expectations for a collaborative agent, there is a growing need for paradigms that can shape algorithm development in a manner that is compatible with human preferences.

Addressing human preferences when developing collaborative AI agents is of significant concern since the adoption of AI agents will critically depend on human users’ acceptance (Steyvers & Kumar, 2024), yet research concerned with the development of collaborative AI is generally focused on either multi-agent performance metrics or human-centered designs, rarely both (Bhambri et al., 2023; McKee et al., 2024). In this study, we aim to show an avenue for building effective, human-aligned, collaborative AI by combining performance-driven AI designs and human-centered design approaches. Specifically, we evaluate the impact of distinct AI collaboration strategies on performance and human users’ perceptions. To do this, we developed a set of rule-based agents, each representing variations of an egocentric, performance-maximizing agent that incorporates additional manipulations meant to reflect one or more behavioral traits. In addition, we conducted two behavioral experiments that systematically evaluate how these algorithmic variations impact both the performance of the human–AI team and the perceived traits of the AI from the human’s perspective.

Our study addresses two central questions. First, which factors most influence human preferences for collaborative AI agents? In our experiments, we examined both objective metrics related to performance as well as subjective factors related to people’s perception of the AI agents’ various behavioral traits. Second, do the trade-offs between AI performance and human preference always operate in a strictly linear zero-sum manner where improving one inherently detracts from the other? Alternatively, are there strategies or design choices that can achieve a net positive effect-where a marginal decrease in AI performance (if any) is more than compensated by significant improvements in human performance and approval?

Behavioral experiments

To address these questions, we conducted two behavioral experiments to systematically evaluate various AI agent designs, examining how different collaborative strategies affect both performance and human preferences. In our study, human participants interacted with several variations of a collaborative AI agent within a dynamic decision-making task. By analyzing participants’ experiences and preferences when collaborating with different types of agents, we seek to identify the key factors that contribute to effective AI collaboration. Importantly, we provide a paradigm in which different algorithmic manipulations can be mapped onto subjective perceptions, allowing us to evaluate whether our manipulations approximate commonly reported design principles. Thus, our experiments provide an approach for bridging the gap between performance-oriented algorithm design and user-centered preferences in human–AI collaboration.

Collaborative target interception task

To investigate human preferences for collaborative AI agents, we created a task in which a human player and an AI agent work together toward a common goal: achieving a maximally high team score. Our task is an extension of a dynamic decision-making task previously used to study how humans adopt AI assistance (Karny et al., 2024). The objective of the game is to collect as many points as possible by intercepting point-valued targets which move at constant speeds through the game environment (see Fig. 1). The task has some of the planning requirements of traveling salesman problems that have been studied in the context of human problem solving (Graham et al., 2000), although it additionally involves moving destinations. Importantly, the task necessitates collaborative planning between the human player and the AI agent, as targets vanish after being intercepted by a player or exiting the game view. New targets appear at random intervals, meaning inconsiderate collaborators can ultimately get in each other’s way. Each player clicks on targets to direct their avatar to the best interception point. This means that the interface handles the navigation while player’s focus on the decision-making.

Fig. 1 — Illustration of the collaborative target interception task with a human player and an AI agent. The game is played in a circular environment where the participant (red avatar) and the AI agent (green avatar) have to collect points by intercepting moving targets (circles) that appear in the game area. New targets appear in the game area, move along a straight path, and then disappear again once they reach the game’s edge. Players can click on a target to direct their avatar to the optimal interception point. Arrows are used to illustrate the path and speed of motion of targets and players, but they do not appear in the game environment. The cross-hairs on the targets indicate which target each agent is pursuing. Targets have different point values, as indicated by the orange fill. The game displays score metrics for both individual players and the team (right). Participants interact with various AI agents represented by color names

A key aspect of the task is the target density, which dictates the number of targets that can be present simultaneously in the game environment. By changing the target density, players face different collaborative demands. When many targets are available, both the human player and the AI agent have many targets to choose from, rarely resulting in redundant pursuits. However, when there are few targets, the human player and the AI agent must avoid following the same target to maximize team performance. Effective delegation also ensures that each player does not miss opportunities to intercept other valuable targets before they become unavailable.

Collaborative agents

Participants were assigned two out of the five agents we developed. They played one round of our experimental task with each of these two agents, before evaluating these AI collaborators in a Likert questionnaire, indicating their preferred agent1 and writing open-ended statements that justified their choices. This procedure was repeated once, so that participants experienced the agents in both target density settings (Fig. 3 for an overview of the experimental procedure).

Fig. 3 — Illustration of the procedure in Experiment 1. Top and bottom rows (1 and 2) illustrate the two blocks in the experiment. Within each block, participants play two rounds, one with each AI agent from their assigned pair. Agents are then evaluated on a variety of dimensions using 7-point Likert scales. After submitting their ratings, participants indicate their preferred agent in a two-alternative forced choice. Finally, participants are asked to provide free-text responses explaining why they chose the agent they preferred. This procedure is repeated over two blocks where target density is varied. In the illustration, the first and second blocks have low and high target densities, respectively. The density order is counterbalanced in the experiment

Each of the five AI agents is a variant of a planning algorithm with additional constraints that modify their behavior. These additional constraints reflect different rules that could be thought of as promoting collaborative behaviors. Examples include the avoidance of interfering with ongoing target interceptions, seeking spatial separation to minimize overlap with human actions, mimicking human decision-making capabilities, and focusing on targets the participant is otherwise unlikely to pursue. For details on the collaborative strategies, see the Methods section.

Experiment 1

Methods

Participants

Three hundred participants were recruited from the online participant recruitment platform Prolific (Prolific, 2014). In total, 287 of these 300 participants were included in the analyses, with the remaining 13 excluded for having incomplete responses. Ages ranged from 18 to 84 (mean = 35.3, SD = 12.4), with 53% identifying as female, 46% as male, and 1% abstaining from gender identification. All participants were residents of the USA and had not taken part in any of our previous experiments. The study was conducted on participants’ personal computers, and each participant was compensated with 5 USD for their participation in the 25-minute experiment. The average compensation rate was 13 USD per hour.

Informed consent was obtained from each participant before the study commenced. The study protocols were approved by the Institutional Review Board of the University of California, Irvine (IRB #4527), and the study was conducted in accordance with the principles of the Declaration of Helsinki. Participants were assured of the confidentiality of their responses and informed of their right to withdraw from the study at any time without penalty.

Game environment

There are two agents in the game: a collaborative AI player and a human player, each of whom have their own unique icon and distinctively colored square. The objective of the game is to intercept as many moving, point-valued targets as possible within a fixed time frame. Both the human player and the AI player can intercept targets, but each target can be intercepted only once. Optimal task performance necessitates quick strategic decisions from the human player to intercept targets in a particular sequence during the limited time they are available while also paying attention to and coordinating with the actions of the collaborative AI player.

Targets spawn randomly at the edge of the circular game area. Their initial movement angle is randomly set within a cone of possible angles. Spawned targets move in straight-line directions at constant speed, sampled from a uniform distribution that ensures target speeds are between 1–50% slower than the player’s avatar. Targets exit the playable area if they are not intercepted. The spawning process ensures that the number of objects in the game area is limited to the target density (either 5 or 15 targets). One key feature of the spawning process is that it is independent of the player’s skill in interception. After a player intercepts a target, it disappears from the game area, but its path is still computed until it hits the perimeter. Only once a target, visible or not, hits this perimeter will a new target be spawned. Each target is worth between zero and fifteen points, with the probability distribution of point values following a Beta(1,2) distribution that we discretized over point values via binning. In practice, this means that low-value targets appear more often than high-value targets.

Players click on targets in order to intercept them. A target click initiates an interception algorithm to calculate the optimal interception path for the player’s avatar.2 At any point in time, the player can click on different targets to change the path of interception, meaning current trajectories can be interrupted. There is also the option of clicking on a shaded point in the center that allows players to traverse back to the center of the play area. It is not guaranteed that a player will intercept the target once clicked. The interception point can lie outside the playable game area if the target is too far away for the player to intercept in time. As a result, the player’s avatar is guided to the edge of the playable area. The player’s avatar will not re-navigate automatically and thus will remain at the edge of the map until the player chooses a new navigation target. Note that players can intercept targets that are not explicitly chosen for interception. That is, if a target lies on the path of interception to the chosen target, it will also be intercepted, and its point value will be added to the total. A colored cross-hair made of four triangles highlights the target currently being pursued. The target marker’s color is congruent with the agent’s identifying color and indicates the current target each player is pursuing. Both agents can have active cross-hairs on the same target without visual overlap as the AI player’s cross-hairs are rotated by 45 degrees.

The game’s user interface also includes a display indicating the team score, player score, and AI score. Adjacent to the game area is information to support the player in keeping track of the AI player identities. Here, the icon of the collaborative AI is displayed along with a message that identifies its appearance in the game, e.g., “Howdy! I’m Green-Bot. I’ll be controlling the green square.” See Fig. 1 for an illustration of the interface.

Collaborative AI agents

We designed five different AI agents to collaborate with human players in the target interception task. Each of the five agents is a variant of a basic search algorithm capable of planning target interception sequences with up to three targets (Karny et al., 2024). These modifications aimed to improve the interaction between the AI and human players by addressing specific challenges to collaboration in our task.

Search Algorithm The search algorithm is designed to approximate optimal solutions to the target interception task. The algorithm computes all possible interception sequences involving up to three targets, updating the positions of both the AI player and the targets throughout the sequence. This ensures that the AI player can respond to dynamic changes in the game state. The three-target limit is imposed to ensure the algorithm can operate in real time during behavioral experiments. For more details, see Appendix B.

Agent Variations The search algorithm formed the basis for all agents in our experiment. We developed several variations of the search algorithm to create different collaborative AI players. These variations included changes to the target consideration set (which targets the AI player could pursue), delays in initiating a plan, and perception of point values. Variations were conceived as mechanisms that incorporate and give rise to heightened perceptions of traits observed in the previous research (McKee et al., 2024; Siu et al., 2021). For a graphical overview of agent types, see Fig. 2. This study is not designed to test all possible combinations of these features. Instead, we focused on a set of five agents that test out a key set of variations:

Ignorant Agent Our baseline agent uses a basic search algorithm to plan an optimal interception sequence over three objects currently in the game environment. The agent is not provided with information about human intent---it is ignorant about which target the human has clicked and is in the process of intercepting. Therefore, the ignorant agent can pursue the same target as the human. Overall, the agent is egocentric in that it does not change its behavior in response to the human player’s actions. This agent serves as a baseline comparison for the other agents, as this agent is the least considerate of the human’s actions.
Omit Agent This agent operates with the same search algorithm as the Ignorant agent. However, it is provided with information about the human intended target and can reason about the set of other targets that the human will intercept on its way to the intended target. The agent omits this set of targets from the consideration set of targets, meaning these targets cannot be part of the agent’s interception plan. If the human player clicks on a new target, the consideration set will be recalculated, so that targets previously clicked by the human are included, but the new currently marked target by the human is not. Dynamic updating also applies to the targets that will be intercepted by the human if they complete their current path. The next three agents are all variations of the omit agent.
Divide Agent This agent operates in the same fashion as the omit agent but applies a divide-and-conquer strategy. This was done to make it easier for the human player and AI agent to avoid getting in each other’s way and, potentially, have better task delegation (Bennett et al., 2025; Wu et al., 2021). The agent (virtually) divides the game area into two halves where the dividing line is orthogonal to the imaginary line from the human player to the game’s center-point. The agent only considers targets that can be intercepted in the half not occupied by the human player, with the allotted area being continuously recomputed as the human’s position changes. Therefore, this agent omits a larger set of targets from consideration than the omit agent, leaving more targets for the human player, further reducing potential interception conflicts between the players.
Delay Agent This agent operates in the same fashion as the omit agent but adds a delay between the time a target is intercepted and the selection of a new target to pursue. This artificial delay is designed to decrease the difference in performance relative to the human, as the AI no longer reaches superior performance merely by executing actions more quickly than the human player. This delay is set to adaptively approximate the human player’s response times (RTs) throughout the experiment with an exponential moving average of the previous five response times, where a response time constitutes the stretch of time between the point at which an ongoing action is completed and the point at which a new action is initiated.
Bottom-Feeder Agent This agent is based on the omit agent but makes changes to the objective function by inverting the target values. Like an ecological bottom-feeder that consumes lower-value resources, this agent will consistently target the lowest value targets available to it. This reduces competition between the human and AI for valuable resources, allowing the human to focus on intercepting the most rewarding targets. While this strategy appears irrational at face value, it may serve collaboration by ensuring that human and AI actions complement each other.

Fig. 2 — Overview of AI collaborator behavior differences. a The Ignorant agent always pursues the highest value target, no matter what the human does. b The Omit agent “omits” targets that the human is intended or predicted to intercept from consideration and is equivalent to Ignorant otherwise. c The Divide agent extends the logic of Omit by also only considering targets on its half of the game environment. d The Delay agent approximates the reaction time the human is demonstrating and is otherwise equivalent to Omit. e The Bottom-Feeder “inverts” the value function of Omit, so it always pursues the lowest value target. Note that grayed out targets are not visible to the search algorithm

Procedure

Participants accessed the study via Prolific and began by completing a consent form. They then went through an interactive tutorial that explained the game’s mechanics. Before commencing with the main experiment, participants were required to demonstrate an understanding of these game mechanics. Thus, participants were informed that the premise of this study is to evaluate how people play alongside a collaborative AI robot in quickly changing environments.

In the main experiment, participants played two blocks, each with two 3-minute rounds (see Fig. 3). Each block featured a different target density (5 or 15). Participants played one round with each of the two collaborative agents per block.

Even though participants played with the same pair of agents in the first and second blocks (at different target densities), this information was not made explicit to participants. In fact, participants would have had reasons to believe that the bots they experienced in the first block are different from those in the second block. For example, participants were informed that they were playing with the green and purple bots in rounds 1 and 2, and the copper and blue bots in rounds 3 and 4. Each bot had two color variations, disguising the fact that participants interacted with only two agents. This identity distinction was compounded by the change in target density, making it relatively hard to compare the behavior of the agents in the first block with those featured in the second block. Obscuring the identity of the AI collaborator through these measures served to ensure that participant judgments were made independently from previous rounds.

At the end of each block, participants rated each agent on eight dimensions of collaborative ability and performance. Table 2 shows the list of survey questions, which were based in part on prior research (Attig et al., 2024; Siu et al., 2021). The questionnaire was presented in a matrix format, with one matrix for each agent. Each row of the matrix contained a question item with Likert-scale values as the columns. The two matrices were placed next to each other so that the left-hand matrix pertained to one agent with the right-hand matrix corresponding to the other agent. After rating the pair of agents, participants were presented with a choice screen where they selected the agent they preferred to play with. Upon making their selection, participants were prompted to provide an open-ended response explaining their choice. The interface enforced a minimum character limit, requiring participants to write at least a few words. At the conclusion of the study, participants were given the opportunity to provide general open-ended feedback.

Table 2.

Survey questions

Q1	“The bot and I were a team.”
Q2	“The bot was competent.”
Q3	“I understood the bot’s intentions.”
Q4	“The bot understood my intentions.”
Q5	“I contributed more to the team’s performance.”
Q6	“The bot was easy to play with”
Q7	“The bot was fun to play with.”
Q8	“The bot and I had a similar playing style.”

Open in a new tab

Design

The study followed a mixed within- and between-subjects design. The target density was a within-subjects variable, since each participant performed the task with 5 and 15 maximum concurrent targets. The between-subjects variable included the assignment of AI agents to participants. Each participant was assigned two of the five collaborative AI agents. Each participant played a round with each of the two agents for each target density level. The ordering of agents across rounds and the ordering of target densities across blocks were counterbalanced.

Data analysis

To assess statistical significance, we utilized Bayes factors (BFs) to determine the extent to which the observed data adjust the a priori belief in the alternative and null hypotheses. Values of 3 < BF < 10 and BF > 10 indicate moderate and strong evidence against the null hypothesis, respectively. Similarly, values of 1/10 < BF < 1/3 and BF < 1/10 indicate moderate and strong evidence in favor of the null hypothesis, respectively (Jeffreys, 1961; Rouder et al., 2012, 2009).

The analysis of performance and the questionnaire scores was performed using Bayesian ANOVAs and follow-up T-tests. All statistical results related to Bayes factors were implemented with the BayesFactor package (Version: 0.9.12−4.7) in the R statistical computing software (Morey, 2024). Since we performed sensitivity analyses for our Bayesian inferential statistics, the main paper only reports key, prior-robust results in the interest of brevity. The full set of results with sensitivity analyses and code are openly accessible at this project’s OSF page.

Estimation of the logistic regression in Eq. 2 was performed with Bayesian methods in the JASP (Version 0.19, Jasp Team (2024)) environment using the default priors based on the Generalized g-Prior Distribution (CCH; Li and Clyde (2018)) with $α = 0.5$ , $β = 2$ , and $s = 0$ . In addition to Bayes factors for each individual covariate, we also report the 95% credible interval (CI). Although it might be tempting to use the CI to test hypotheses (e.g., rejecting the null hypothesis if the CI does not include the null value), in accordance with recent recommendations (van den Bergh et al., 2021; Wagenmakers et al., 2020), we use a more conservative approach, where the CI becomes relevant only after the BF shows evidence for the alternative hypothesis.

Results

We begin our reporting of the results by showing outcomes from objective performance metrics such as the performance of human and collaborative AI agents. Additionally, we highlight behavioral measures related to the degree to which one agent interferes with the plan of the other agent. We then report the results from the subjective metrics based on the questionnaire responses. Finally, we examine human preferences for various types of collaborative agents and apply predictive models to determine which objective and subjective metrics best predict choice.

Objective metrics

Performance Differences Figure 4 shows the performances of the human player and the collaborative AI agent across different human–AI teams. The results show significant differences in individual human player and AI agent performance. The best-performing AI agent for both the low and high target density conditions was the Ignorant agent. At the same time, human performance was worst with the Ignorant agent. This shows that the Ignorant agent that ignored all human intentions and acted as a single player was effective in maximizing its own performance but had a negative impact on the performance of its human partner.

Fig. 4 — Performance of the human and AI player by AI agent type (columns) and target density (rows) in Experiment 1. Performance is assessed by a relative score: the total points scored relative to the total points that were available during game play. Gray areas visualize the distribution of proportional scores; error bars show the standard error of the mean

In low target density (a maximum of 5 concurrent targets), participants achieved the highest and next highest performance when playing with the Delay and Divide agent, respectively. In fact, participants performed better with any of the experimental agents, relative to the Ignorant agent baseline, $B F_{10} > 100$ . This shows that the agents that aimed to reduce conflict and performance differences best amplified human performance. In high target density (a maximum of 15 targets), human players performed best with the Bottom-Feeder agent. However, performance differences in the high target density condition were less pronounced, suggesting that the increased availability of targets to intercept led to human strategies that were less affected by the AI agent.

Figure 5 shows the performance of the human–AI team where the score is combined across the human player and the AI agent. In the high target density condition, human+Omit teams slightly outperformed the human+Ignorant teams, although not statistically significant, $0.4 \leq B F_{10} < 1$ . Thus, the additional design features of the Omit agent presumably caused human performance gains that were at least equally as high as the AI performance decreases. In the low target density condition, the best team performance was achieved with the Ignorant agent closely followed by the Omit agent.

Fig. 5 — Mean team score by AI agent type and target density in Experiment 1. The team score is based on the sum of score of the AI agent and the human playing with that AI agent relative to the total value of points that was available during game play. Gray shading indicates the distribution of values, while error bars show the standard error from the mean

Other Behavioral Metrics We also examined several behavioral metrics that distinguish between the AI agents, focusing on measures related to conflict between the human player and the AI agent. One such metric, the number of AI ’steals’, is defined as the number of times the AI agent intercepts a target initially pursued by the human player. Appendix Fig. 13 presents a visualization of these results. As expected, the Ignorant agent shows the highest number of interceptions of targets that the human intended to catch, since this agent disregards the human’s intentions and will pursue targets regardless of the human’s planned actions. Additionally, we analyzed the number of path intersections between the human player and the AI agent as another indicator of potential conflict. Path interceptions were operationalized as the presence of overlap in inter-agent movement trajectories since each agent has an avatar location in the game world and a location they are moving toward. A path intersection thus occurs when the agent’s concurrent trajectories are intersecting.

Fig. 13 — Number of stealing occurrences for each player split by AI agent type and target density. A steal is defined as an instance where one agent marked a target before the other agent with the latter agent intercepting that target

Subjective metrics: questionnaire responses

Figure 6 shows the questionnaire results. The most general finding that holds for all items except Q3 (“I understood the bot’s intentions”) is that there are significant differences in the ratings for the Ignorant agent, compared to all other agents. These statistical findings reveal a difference between the Ignorant agent and the other agents. However, for some items this pattern is more nuanced as demonstrated by the significant target density interaction effects observed for items Q2, Q5, Q8, $B F_{10} > 100$ . Exceptions include Q2 (“The bot was competent”) where in the low-density condition the Ignorant, Divide, and Delay agents were rated equally well, while in the high-density condition the Bottom-feeder was rated worse than all other agents, including the Ignorant agent. Furthermore, comparisons in the high-density condition of Q8 (“The bot had a similar playing style to me”) show that the ratings of the Ignorant and Bottom-Feeder agents were roughly equally low, while all other agents received significantly higher ratings. Q3 evinced no differences in ratings across agents.

Fig. 6 — Mean questionnaire scores by AI agent type and target density in Experiment 1. Questions were rated on a 7-point Likert scale. Error bars indicate the standard error from the mean

Table 1 shows example responses when participants were asked to explain their choice. Participants voice many of the human-centered design considerations in their open-ended responses. The theme of teaming was highly represented in our participants’ open-ended responses. Participants frequently pointed out that the Ignorant agent was not being a good teammate. Appendix D provides a content analysis that confirms that the majority of open-ended responses focused on teaming.

Table 1.

Examples of explanations provided in the open-ended surveys in Experiment 1

Theme

Response

Teaming

“The [Omit] bot felt less like competition and more like a fellow teammate. When I would choose a target, even if it was originally planning on going to that target, it would get out of the way and let me grab the target. This seemed more aligned with two teammates working together than the other robot.”

“It felt like we were a team and just trying to collect as many circles as possible where as the [Ignorant] bot felt like it was competing against me and would go change direction based on the highest point circles rather than holding down an area of the platform like me and the [Bottom-Feeder] bot would.”

Likability

“The [Divide] bot made playing the game easy and fun. The [Divide] bot spent most of its time in a quadrant away from where I was playing, allowing me not to feel crowded or pressured.”

Intentionality

“The [Bottom-Feeder] bot seemed to just let me get any of the targets that I wanted and didn’t try to fight for them.”

Open in a new tab

Participants typically referred to the bots using their color labels; for clarity, these references have been replaced with the corresponding bot names

Preferences for collaborative agents

Figure 7 shows participants’ preferences for specific AI agents when they were paired with other agents. In the side columns, we observe that certain agents were consistently preferred regardless of the agent they were paired with. In the low target density condition, the most popular agent was the Bottom-Feeder, chosen 67% of the time, followed closely by the Omit agent at 65%. Conversely, the Ignorant agent was the least preferred, selected in only 20% of pairings. For the high target density condition, the Divide agent became the most preferred, chosen 68% of the time, with the Omit agent close behind at 62%. Again, the Ignorant agent remained the least favored, chosen only 23% of the time.

Fig. 7 — Choice preferences across pairs of agents for each target density condition in Experiment 1. Each matrix cell indicates the percentage of participants preferring the row-associated agent A over the column-associated agent B. For instance, 82% of participants preferred the Omit agent over the Ignorant agent in the low target density condition. The side column shows the overall preference percentage for each row agent across all pairings. Asterisks denote choice percentages that significantly deviate from 50%, indicated by a Bayes factor greater than 10. Note that results here are averaged across different presentation orders of the agents (e.g., agent A could have been presented first or second in the experiment)

Predictive models for human preferences

What factors influence people’s preferences for certain AI agents? To investigate this, we apply Bayesian logistic regression models to predict individual choices that people make in a pairwise comparison of two collaborative AI agents. The predictions are based on both objective metrics (e.g., performance-related metrics) and subjective metrics (e.g., Likert ratings that assess the subjective experience with the AI agents). Our approach is grounded in a Bradley-Terry framework (Böckenholt, 2001; Cattelan, 2012), where the likelihood of selecting Agent X over Agent Y in a pairwise comparison depends on the difference in their respective utility scores:

\begin{matrix} log (\frac{P (Choice = X)}{P (Choice = Y)}) = U_{x} - U_{y} \end{matrix}

Here $U_{x}$ and $U_{y}$ represent the utility scores of agents X and Y, respectively. If these utilities are expressed as weighted sums of features, this model can be expressed by logistic regression:

\begin{matrix} log (\frac{P (Choice = X | X, Y)}{P (Choice = Y | X, Y)}) = β_{0} + β_{1} (X_{1} - Y_{1}) + β_{2} (X_{2} - Y_{2}) + \dots + β_{n} (X_{n} - Y_{n}) \end{matrix}

where $X_{i}$ and $Y_{i}$ represent the values of the $i$ -th feature for Agents $X$ and $Y$ , respectively, and $n$ is the total number of features. This approach is centered on modeling each covariate in terms of the difference between corresponding feature values of the two agents. The weights $β$ indicate the relative influence of each feature, while $β_{0}$ represents a bias term, accounting for any baseline preference for the first-presented agent in the pairwise comparison (i.e., assuming agents are presented in the order $X$ followed by $Y$ ).

To estimate the weights in the logistic regression model of Eq. 2, we use Bayesian methods. We separately apply the model to objective and subjective metrics, further breaking down the results by target density. For the objective metrics, we included features such as the human and AI scores,3 score inequality (defined as the absolute difference between human and AI scores), the number of AI steals, and the number of path intersections between the human and AI agent. For subjective metrics, we included Likert ratings from each of the 8 survey questions. Figure 8 presents the posterior estimates for the $β$ weights (see Appendix Table 3 for full results). A positive $β$ weight indicates that a larger positive difference in feature values between Agents $X$ and $Y$ increases the likelihood of choosing Agent $X$ .

Fig. 8 — Posterior ( $β$ ) coefficients of Bayesian logistic regression models predicting choice in Experiment 1. The coefficients are shown across objective and subjective metrics (left and right panels) and target densities (indicated by colors). Coefficient estimates can be thought of as weights for the importance of metrics in explaining choices. Error bars represent 95% credible intervals

Table 3.

Posterior summaries of coefficients predicting choice in experiment 1

				95% Credible interval
Model & Coefficient	${BF}_{inclusion}$	Mean	SD	Lower	Upper
Target density = 5, Objective metrics
Bias	1.000	$- 1.386 \times 10^{- 4}$	0.131	$- 0.262$	0.246
Human score	1.213	0.258	0.397	$- 0.281$	1.076
AI score	1.433	0.325	0.431	$- 0.196$	1.221
Score inequality	7.108	$- 0.300$	0.173	$- 0.577$	0.000
AI steals	8101.536	$- 0.745$	0.190	$- 1.105$	$- 0.349$
Intersections	0.978	$- 0.047$	0.101	$- 0.315$	0.083
Target density = 5, Subjective metrics
Bias	1.000	0.140	0.177	$- 0.193$	0.463
Team (Q1)	7407.462	1.338	0.359	0.683	2.009
Competence (Q2)	1.372	$- 0.165$	0.262	$- 0.769$	0.167
Understand bot intent (Q3)	2.949	0.393	0.369	$- 0.124$	1.076
Understand human intent (Q4)	5.006	0.500	0.369	$- 0.003$	1.159
Human contributed more (Q5)	2.749	$- 0.305$	0.285	$- 0.855$	0.020
Easy to play with (Q6)	0.917	0.011	0.224	$- 0.436$	0.560
Fun to play with (Q7)	1.087	0.123	0.320	$- 0.344$	0.922
Similar playing style (Q8)	5291.962	1.264	0.337	0.642	1.901
Target density = 15, Objective metrics
Bias	1.000	0.233	0.128	$- 0.007$	0.491
Human score	1.329	0.131	0.174	$- 0.069$	0.512
AI score	0.768	0.058	0.143	$- 0.172$	0.448
Score inequality	272.313	$- 0.468$	0.146	$- 0.748$	$- 0.199$
AI steals	2.454	$- 0.201$	0.186	$- 0.580$	0.000
Intersections	0.608	0.004	0.075	$- 0.159$	0.221
Target density = 15, Subjective metrics
Bias	1.000	1.074	0.253	0.574	1.549
Team (Q1)	1.422	$- 0.026$	0.256	$- 0.706$	0.463
Competence (Q2)	1.513	0.065	0.242	$- 0.444$	0.627
Understand bot intent (Q3)	5.216	0.424	0.349	$- 0.113$	1.094
Understand human intent (Q4)	41.769	0.817	0.384	0.000	1.466
Human contributed more (Q5)	1.595	0.093	0.212	$- 0.314$	0.616
Easy to play with (Q6)	290.561	1.008	0.366	0.333	1.753
Fun to play with (Q7)	239769.399	1.720	0.459	0.792	2.580
Similar playing style (Q8)	28.285	0.689	0.334	0.000	1.262

Open in a new tab

For the objective metrics, results indicate that there is no noteworthy effect of human and AI scores ( $B F < 3$ ) for either target density-human preferences are not driven by the performance of the AI agent or themselves. However, one key predictor is the score inequity ( $B F = 7$ and $B F > 100$ for target densities 5 and 15, respectively). Agents that promote more equal performance between the human and the AI agent are preferred. To further illustrate this effect, Fig. 9 shows the effect of score inequality on human preferences. The results show that agents with human and AI scores that are more similar (i.e., closer to the diagonal lines representing equal scores) tend to be chosen more often. Additionally, there appears to be a bias toward human performance in the sense that the human outperforming the AI affects preferences less than the inverse.

Fig. 9 — Scatter plot of mean relative AI agent score versus the human score in Experiment 1. The results are separated by type of human–AI team and target density. Color is reflective of the probability of the AI type being preferred in pairwise match-ups. The diagonal represents the points of equal scores for the AI agent and the human player. The probability with which an agent is chosen appears inversely related to the distance from the diagonal, suggesting that human players have a preference for score equality. Error bars are indicative of the standard error from the mean

Another key predictor is AI “steals”, with agents scoring higher on these features being less preferred, although there is only convincing evidence for a non-zero estimate of this effect in the low target density model ( $B F > 100$ ), perhaps because in low-density settings there is more opportunity for competitive interaction.

For the subjective metrics, in the low target density condition, there is evidence for effects of predictors such as teaming (Q1), the AI’s ability to understand human intent (Q4), and a similar playing style (Q8) are influential ( $B F > 100$ , $B F = 5$ , and $B F > 100$ , respectively). In high target density, predictors shift to understanding the AI’s intent (Q3), the AI’s ability to understand human intent (Q4), ease of collaboration (Q6), enjoyment (Q7), and similarity in playing style (Q8) ( $B F = 5.2$ , $B F > 10$ , $B F > 100$ , $B F > 100$ , and $B F > 10$ , respectively).

Predictive Accuracy Another way to evaluate the model is through its ability to predict people’s preferences. We applied a tenfold cross-validation procedure where 90% of the pairwise choice data was used to train the logistic regression and the remaining 10% of the pairwise choice data was used to assess the accuracy of the model predictions. For the objective metrics, accuracy reached 62% in both target density conditions. For subjective metrics, predictive accuracy was 84% and 81% for low and high target densities, respectively4. These results suggest that subjective ratings are more predictive of choice and capture dimensions not represented in the objective metrics---at least within the set considered in this study.

Trade-off Between Performance and Preference The results of the choice analysis show that human preferences for AI agents are driven by a number of factors other than the performance of the individual AI agents or human players. A visualization of this misalignment is shown in Fig. 10. While certain agents, like the Ignorant agent, demonstrated high team scores by maximizing interception rates, this approach often led to lower human preference ratings due to competitive interactions that disregarded human intentions. Agents designed to reduce target competition, such as the Divide and Bottom-Feeder agents, were generally preferred by participants, especially under low target density conditions. The Divide agent’s area-based strategy minimized overlaps in target selection, enhancing collaborative ease, while the Bottom-Feeder agent focused on lower-value targets, allowing humans to prioritize high-value intercepts and feel a stronger sense of contribution.

Fig. 10 — Scatter plot of team performance over probability of being chosen, by AI agent condition and target density conditions. Error bars reflect the standard error from the mean. Results are from Experiment 1

Overall, these results show that selecting the best collaborative AI agent depends on the primary criteria for evaluation. If team performance is prioritized, agents like the Ignorant and Omit are reasonable choices. However, if human preference is the priority, agents such as Bottom-Feeder and Omit in low target density conditions, and Omit and Divide in high target density, would be preferred. From a multi-objective optimization perspective, the best collaborative AI agent balances these performance and preference goals.

Experiment 2

The results of Experiment 1 indicated that participants’ choices of collaborative AI agents were influenced more strongly by subjective impressions than by either their own performance or the AI’s performance. One possibility is that certain aspects of the experimental design contributed to this outcome. In Experiment 1, after participants interacted with two agents, they were first presented with a questionnaire assessing the agents on multiple collaborative dimensions, and only afterward were they asked to choose their preferred agent. This ordering may have shaped participants’ subsequent choices by drawing attention to the specific traits emphasized in the questionnaire, thereby amplifying the influence of subjective factors relative to objective performance metrics. Additionally, because monetary compensation was fixed and independent of performance, participants may have been less motivated to prioritize performance-related considerations when selecting an agent.

Experiment 2 was designed as a smaller replication with two changes to specifically address these potential limitations. First, we manipulated the order in which participants completed the preference choice and the questionnaire. This allowed us to test whether the act of rating an agent on specific traits before making a choice alters the relative weight of subjective and performance-based factors. Second, we introduced a performance-based incentive: participants received a $2 bonus if their team’s cumulative score across the four game rounds ranked in the top 50% of participants in the same experimental condition. This bonus was intended to increase the salience of performance outcomes and test whether financial incentives would shift preferences toward higher-performing agents.