Evaluating AI Agents for Dangerous (Cognitive) Capabilities
Abstract: AI agents based on Large Language Models (LLMs) demonstrate human-level performance at some theory of mind (ToM) tasks (Kosinski 2024; Street et al. 2024). Here ToM is roughly the ability to predict and explain behaviour by attributing mental states to oneself and others. ToM capabilities matter for AI safety because, at least in humans, […]
Evaluating AI Agents for Dangerous (Cognitive) Capabilities Read More »