Publikation
Behavior-Constrained Reinforcement Learning with Receding-Horizon Credit Assignment for High-Performance Control
Siwei Ju; Jan Tauberschmidt; Oleg Arenz; Peter van Vliet; Jan Peters
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2604.03023, Pages 1-15, arXiv, 2026.
Zusammenfassung
Learning high-performance control policies that re-
main consistent with expert behavior is a fundamental challenge
in robotics. Reinforcement learning can discover high-performing
strategies but often departs from desirable human behavior,
whereas imitation learning is limited by demonstration quality
and struggles to improve beyond expert data. This challenge is
particularly pronounced in high-performance dynamic systems,
where desirable behavior is expressed over trajectories and the
consequences of suboptimal decisions often emerge only after a
delay. We propose a behavior-constrained reinforcement learning
framework that improves beyond demonstrations while explicitly
controlling deviation from expert behavior. Because expert-
consistent behavior in dynamic control is inherently trajectory-
level, we introduce a receding-horizon predictive mechanism
that models short-term future trajectories and provides look-
ahead rewards during training. To account for the natural
variability of human behavior under disturbances and changing
conditions, we further condition the policy on reference trajec-
tories, allowing it to represent a distribution of expert-consistent
behaviors rather than a single deterministic target. Empirically,
we evaluate the approach in high-fidelity race car simulation
using data from professional drivers, a domain characterized by
extreme dynamics and narrow performance margins. The learned
policies achieve competitive lap times while maintaining close
alignment with expert driving behavior, outperforming baseline
methods in both performance and imitation quality. Beyond
standard benchmarks, we conduct human-grounded evaluation in
a driver-in-the-loop simulator and show that the learned policies
reproduce setup-dependent driving characteristics consistent with
the feedback of top-class professional race drivers. These results
demonstrate that our method enables learning high-performance
control policies that are both optimal and behavior-consistent,
and can serve as reliable surrogates for human decision-making
in complex control systems.
