Publikation
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
Anish Abhijit Diwan; Davide Tateo; Christopher E. Mower; Haitham Bou-Ammar; Jan Peters; Oleg Arenz
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2605.11020, Pages 1-24, arXiv, 2026.
Zusammenfassung
Inverse reinforcement learning (IRL) is typically
formulated as maximizing entropy subject to
matching the distribution of expert trajectories.
Classical (dual-ascent) IRL guarantees monotonic
performance improvement but requires fully solv-
ing an RL problem each iteration to compute dual
gradients. More recent adversarial methods avoid
this cost at the expense of stability and mono-
tonic dual improvement, by directly optimizing
the primal problem and using a discriminator to
provide rewards. In this work, we bridge the gap
between these approaches by enabling monotonic
improvement of the reward function and policy
without having to fully solve an RL problem at
every iteration. Our key theoretical insight is that
a trust-region-optimal policy for a reward func-
tion update can be globally optimal for a smaller
update in the same direction. This smaller update
allows us to explicitly optimize the dual objective
while only relying on a local search around the
current policy. In doing so, our approach avoids
the training instabilities of adversarial methods,
offers monotonic performance improvement, and
learns a reward function in the traditional sense
of IRL—one that can be globally optimized to
match expert demonstrations. Our proposed algo-
rithm, Trust Region Inverse Reinforcement Learn-
ing (TRIRL), outperforms state-of-the-art imita-
tion learning methods across multiple challenging
tasks by a factor of 2.4x in terms of aggregate
inter-quartile mean, while recovering reward func-
tions that generalize to system dynamics shifts.
