Publication
Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents
Quentin Delfosse; Sebastian Sztwiertnia; Mark Rothermel; Wolfgang Stammer; Kristian Kersting
In: Amir Globersons; Lester Mackey; Danielle Belgrave; Angela Fan; Ulrich Paquet; Jakub M. Tomczak; Cheng Zhang (Hrsg.). Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. International Conference on Neural Information Processing (ICONIP), arXiv, 2024.
Abstract
Goal misalignment, reward sparsity and difficult credit assignment are only a few of
the many issues that make it difficult for deep reinforcement learning (RL) agents to
learn optimal policies. Unfortunately, the black-box nature of deep neural networks
impedes the inclusion of domain experts for inspecting the model and revising
suboptimal policies. To this end, we introduce Successive Concept Bottleneck
Agents (SCoBots), that integrate consecutive concept bottleneck (CB) layers. In
contrast to current CB models, SCoBots do not just represent concepts as properties
of individual objects, but also as relations between objects which is crucial for many
RL tasks. Our experimental results2 provide evidence of SCoBots’ competitive
performances, but also of their potential for domain experts to understand and
regularize their behavior. Among other things, SCoBots enabled us to identify a
previously unknown misalignment problem in the iconic video game, Pong, and
resolve it. Overall, SCoBots thus result in more human-aligned RL agents.
