Abstract:
Identifying the presence of consciousness is difficult even in humans when one does not want to rely on subjective reports [1,2], but it is increasingly more difficult in animals [3,4], and, due to the lack of common evolutionary history, even more so in artificial systems [5,6]. Currently, there are two major alternatives on the table: a theory-heavy [7] approach that relies on what our best theories of consciousness say [8-10] and a theory-light approach that tries to identify theory-independent behavioural markers of consciousness [7,11-17]. The theory-heavy approach derives indicators of consciousness from theories with meaningful empirical support [8], and argues that one’s level of confidence in that a given system is conscious is determined by the similarity between indicators and the computational processes implemented and one’s credence in the theories the indicators are derived from (plus one’s level of commitment to computational functionalism). In contrast, the theory-light approach is aimed at relying purely on correlational evidence that link certain behavioural responses to the presence of consciousness in humans and uses these behaviours as markers of consciousness in non-human cases [3,4,7,11,13,15,18,19]. This approach is driven by the so-called facilitation hypothesis claiming that what makes stimulus-specific behavioural responses markers of consciousness is that in healthy humans the conscious perception of relevant stimuli, relative to the unconscious processing of the same stimuli, causes or facilitates the responses in question [7,15,20].
There are well-known problems with both approaches. On the one hand, it is unknown how far one can abstract away from the complex human cognitive architecture before the conditions for consciousness offered by human-centric theories lose applicability [7,10,15,21-23]. This renders assigning credence on the basis of loose similarities between indicators and the computational processes implemented in a system problematic. On the other hand, behaviour-based tests of consciousness can be ‘gamed’ [8,17,24,25]—passed by systems that are ‘gerrymandered’ [26], i.e. designed specifically to pass the test without implementing capacities relevant for consciousness— rendering the applicability of the theory-light approach to the case of machine consciousness problematic (although see [17] for a counter-argument).
The present paper contributes to this debate. The first, negative part presents novel challenges for both approaches. It argues that many of the theoretical indicators of consciousness proposed so far [8,10,23] are derived from auxiliary hypotheses—contingent links between a theory’s central assumptions regarding what consciousness is [27] and empirical and implementation-specific details,—which renders indicator properties ad hoc as these criteria are often subject to change when empirical predictions turn out to be false [27-31]. Moreover, credence-based approaches [8] seem to be ill-formed from the perspective of settling disputed cases. The scientific study of consciousness is struggling with an abundance of competing theories that resist both elimination and convergence [32-38]. Different groups of scientists assign different credences to different theories, so it is not clear whose credence it is that should matter when an artificial system implements only some indicators. The paper also argues that the facilitation hypothesis, and thus the behavioural markers-based approach in general, contrary to the explicit claims of its proponents [7], is not compatible with most theories of consciousness, but rather has a strong theory bias. Consciousness and the behavioural responses in question might be products of independent mechanisms that are activated by a common cause in the human case, and thus can come apart in non-human cases. One of the most popular theories of consciousness—the so-called higher-order approach [39-41]—relies on a cognitive architecture that implements this common-cause structure. As a result, the facilitation hypothesis might produce a lot of false positives, predicting the presence of consciousness in cases where, according to higher-order theories, consciousness is not present. Moreover, so-called local recurrence theories of consciousness [32,42,43] are also incompatible with the cognitive architecture that the facilitation hypothesis assumes, which threatens with the possibility of a lot of false negatives, i.e. predictions of no consciousness in cases in which consciousness might nevertheless be present. The negative part of the paper concludes that such theoretical biases undermine the very purpose of the facilitation hypothesis as the foundation of a theory-light approach.
The positive part of the paper argues that in light of these challenges, the best strategy is to shift the focus from behavioural outcomes to those cognitive processes—internal representations and operations over them—that result in the behaviour in question. As relying on behavioural markers cannot bypass theoretical commitments, the internal mechanisms producing the behavioural markers proposed need to be analysed in order to determine their theoretical biases. Utilising what cognitive neuroscience reveals about the internal representations and information processing that produces the behavioural responses in question, it becomes possible to reveal which theories of consciousness different behavioural markers are compatible with. At the same time, this very information about representational structure and information flow can serve as the basis of comparison that artificial systems can be checked against. The paper demonstrates, how recent methodologies of state-of-the-art AI interpretability research (classifier probes [44], representation engineering [45] and mechanistic interpretability [46,47]) could be used to determine whether such representational markers of consciousness could be found in artificial systems [48]. Utilising the mechanistic explanatory framework [49-52], the paper offers a systematic discussion of how stimulus-response-focused behavioural analysis, representational descriptions and the organised activities of sub-systems can be related to each other, clarifying how abstract computational characteristics are related to implementation-specific details. The paper also argues that this representational markers strategy is much more immune to the ‘gaming problem’ [8,17,24,26] as ‘gaming’ a test by reproducing internal operations in deep neural networks is significantly more difficult than by reproducing behaviour. The paper concludes by comparing the representational markers approach to recent proposals that also emphasise representational/mechanistic similarity [53,54] and by clarifying that in this framework the problems of how far one can abstract away from complex human cognitive architecture [7,10,15,21-23] and whether consciousness is a biological or computational phenomenon [55] translate onto the challenge of determining how ‘deep’ the similarity between humans and artificial systems in terms of levels of mechanisms should be.
