Publikation
XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies
Daniel Palenicek; Florian Vogt; Joe Watson; Ingmar Posner; Danica Kragic; Jan Peters
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2605.10734, Pages 1-22, arXiv, 2026.
Zusammenfassung
For reinforcement learning in the real world, online exploration is expensive. A
common practice in robotic reinforcement learning is to incorporate additional
data to improve sample efficiency. Expert demonstration data is often crucial for
solving hard exploration tasks with sparse rewards. While prior data is used to
augment experience and pre-train models, we show that the design of existing
algorithms fails to achieve the sample efficiency that is possible in this setting
due to a failure to use pretrained policies effectively. We propose XQCfD, which
extends the sample-efficient XQC actor-critic to learn from demonstrations, using
augmented replay buffers, pre-trained policies and stationary policy architectures,
designed to avoid rapidly ‘unlearning’ the strong initial policy like prior works.
We show our stationary network architecture enables policy improvement out-of-
distribution better than standard network architectures due to its higher entropy
predictions. XQCfD achieves state of the art performance across a range of complex
manipulation tasks with sparse rewards from the popular Adroit, Robomimic and
MimicGen benchmarks — notably, with a low update-to-data ratio and no ensemble
networks.
