Publication

Safe Reinforcement Learning Through Regret and State Restorations in Evaluation Stages

Timo P. Gros; Nicola Müller; Daniel Höller; Verena Wolf

In: Workshop on Reliable Data-Driven Planning and Scheduling. International Conference on Automated Planning and Scheduling (ICAPS-2024), Springer, 2024.

Abstract

Deep reinforcement learning (DRL) has succeeded tremendously in many complex decision-making tasks. However, for many real-world applications standard DRL training results in agents with brittle performance because, in particular for safety-critical problems, the discovery of both, safe and successful strategies is very challenging. Various exploration strategies have been proposed to address this problem. However, they do not take information about the current safety performance into account; thus, they fail to systematically focus on the parts of the state space most relevant for training. Here, we propose r egret a nd state r estoration in e valuation-based deep reinforcement learning (RARE), a framework that introduces two innovations: (i) it combines safety evaluation stages with state restorations, i.e., restarting episodes in formerly visited states, and (ii) it exploits estimations of the regret, i.e., the gap between the policies’ current and optimal performance. We show that both innovations are beneficial and that RARE outperforms baselines such as deep Q-learning and Go-Explore in an empirical evaluation.