Skip to main content
. 2024 May 10;5(5):100988. doi: 10.1016/j.patter.2024.100988

Table 1.

Overview of examples of AI systems’ learned deception

Manipulation: Meta developed the AI system CICERO to play Diplomacy. Meta’s intentions were to train CICERO to be “largely honest and helpful to its speaking partners.”4 Despite Meta’s efforts, CICERO turned out to be an expert liar. It not only betrayed other players but also engaged in premeditated deception, planning in advance to build a fake alliance with a human player in order to trick that player into leaving themselves undefended for an attack.
Feints: DeepMind created AlphaStar, an AI model trained to master the real-time strategy game Starcraft II.5 AlphaStar exploited the game’s fog-of-war mechanics to feint: to pretend to move its troops in one direction while secretly planning an alternative attack.6
Bluffs: Pluribus, a poker-playing model created by Meta, successfully bluffed human players into folding.7
Negotiation: AI systems trained to negotiate in economic transactions learned to misrepresent their true preferences in order to gain the upper hand in both Lewis et al.8 and Schulz et al.9
Cheating the safety test: AI agents learned to play dead, in order to avoid being detected by a safety test designed to eliminate faster-replicating variants of the AI.10
Deceiving the human reviewer: AI systems trained on human feedback learned to behave in ways that earned positive scores from human reviewers by tricking the reviewer about whether the intended goal had been accomplished.11

In each of these examples, an AI system learned to deceive in order to increase its performance at a specific type of game or task.