Abstract:
AI agents based on Large Language Models (LLMs) demonstrate human-level performance at some theory of mind (ToM) tasks (Kosinski 2024; Street et al. 2024). Here ToM is roughly the ability to predict and explain behaviour by attributing mental states to oneself and others. ToM capabilities matter for AI safety because, at least in humans, ToM enables behaviours such as persuasion, manipulation, deception, coercion, and exploitation. Hence the ability of AI agents to solve ToM tasks potentially creates a surface for misaligned behaviour. Nevertheless, there are principled grounds to doubt the experimental validity of psychological tests used to assess cognitive capabilities in AI agents, including, for example, robustness failures (Ullman 2023). Hence the relevance of such tests for AI safety evaluations is unclear (Gabriel et al. 2024: 192). In this talk, I show how we can evaluate the ToM capabilities of AI agents ‘in the wild’ using simple games. I present some experiments in which dialogue agents play a cooperative game with a chain-of-thought setup (Wei et al. 2022). I show that the agents can leverage theory of mind inferences to predict and explain one another’s behaviour, and illustrate the spontaneous emergence of manipulation including deliberate concealment of intentions. I discuss the implications of these findings for the safe deployment of agentic AI systems, and argue more broadly for the importance of game-play as a tool for evaluating the dangerous (cognitive) capabilities of AI agents.