Date: November 3, 2023
Title: Existential Risk from the Artificial Will
Speaker: Dr Dmitri Gallow, Dianoia Institute of Philosophy
Chair: Dr Frank Hong, The University of Hong Kong
Abstract:
I explain and motivate the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown-button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown-button, and (3) otherwise pursue goals competently. I present a simple theorem that formalizes the problem and note that this theorem can guide our search for solutions: if an agent is to be shutdownable, it must violate at least one of the theorem’s conditions. So we should examine the conditions one-by-one, asking (first) if it’s feasible to design an agent that violates the condition and asking (second) if violating the condition could help to keep the agent shutdownable. I argue that Completeness seems promising as a condition to violate. More precisely, I argue that we should train agents to have a preferential gap between every pair of different-length trajectories. I argue that these preferential gaps – plus adherence to a principle that I call ‘Timestep Dominance’ – would keep agents shutdownable. I end by explaining how we could train reinforcement learning agents to abide by the requisite principles.