Publication
Time-Efficient Reinforcement Learning with Stochastic Stateful Policies
F. Al-Hafez; G. Zhao; Jan Peters; D. Tateo
In: 17th European Workshop on Reinforcement Learning (EWRL 2024). European Workshop on Reinforcement Learning (EWRL-2024), EWRL, 2024.
Abstract
Stateful policies play an important role in reinforcement learning, such as handling
partially observable environments, enhancing robustness, or imposing an inductive bias directly into the policy structure. The conventional method for training
stateful policies is Backpropagation Through Time (BPTT), which comes with
significant drawbacks, such as slow training due to sequential gradient propagation and the occurrence of vanishing or exploding gradients. The gradient is often
truncated to address these issues, resulting in a biased policy update. We present
a novel approach for training stateful policies by decomposing the latter into a
stochastic internal state kernel and a stateless policy, jointly optimized by following the stateful policy gradient. We introduce different versions of the stateful
policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning and imitation learning algorithms. Furthermore, we
provide a theoretical analysis of our new gradient estimator and compare it with
BPTT. We evaluate our approach on complex continuous control tasks, e.g. humanoid locomotion, and demonstrate that our gradient estimator scales effectively
with task complexity while offering a faster and simpler alternative to BPTT.